[PATCH] Incremental sort (was: PoC: Partial sort)

Started by Alexander Korotkovalmost 9 years ago359 messages

a.korotkov@postgrespro.ru

almost 9 years ago

1 attachment(s)

Hi all!

I decided to start new thread for this patch for following two reasons.
* It's renamed from "Partial sort" to "Incremental sort" per suggestion by
Robert Haas [1]. New name much better characterizes the essence of
algorithm.
* I think it's not PoC anymore. Patch received several rounds of review
and now it's in the pretty good shape.

Attached revision of patch has following changes.
* According to review [1], two new path and plan nodes are responsible for
incremental sort: IncSortPath and IncSort which are inherited from SortPath
and Sort correspondingly. That allowed to get rid of set of hacks with
minimal code changes.
* According to review [1] and comment [2], previous tuple is stored in
standalone tuple slot of SortState rather than just HeapTuple.
* New GUC parameter enable_incsort is introduced to control planner
ability to choose incremental sort.
* Test of postgres_fdw with not pushed down cross join is corrected. It
appeared that with incremental sort such query is profitable to push down.
I changed ORDER BY columns so that index couldn't be used. I think this
solution is more elegant than setting enable_incsort = off.

Also patch has set of assorted code and comments improvements.

Links
1.
/messages/by-id/CA+TgmoZapyHRm7NVyuyZ+yAV=U1a070BOgRe7PkgyrAegR4JDA@mail.gmail.com
2.
/messages/by-id/CAM3SWZQL4yD2SnDheMCGL0Q2b2oTdKUvv_L6Zg_FcGoLuwMffg@mail.gmail.com

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-1.patchapplication/octet-stream; name=incremental-sort-1.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 0b9e3e4..408e14d
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1803,1841 ****
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
   Limit
!    Output: t1.c1, t2.c1
     ->  Sort
!          Output: t1.c1, t2.c1
!          Sort Key: t1.c1, t2.c1
           ->  Nested Loop
!                Output: t1.c1, t2.c1
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c1
!                      Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c1
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c1
!                            Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!  c1 | c1  
! ----+-----
!   1 | 101
!   1 | 102
!   1 | 103
!   1 | 104
!   1 | 105
!   1 | 106
!   1 | 107
!   1 | 108
!   1 | 109
!   1 | 110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
--- 1803,1841 ----
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c2, t2.c2 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c2, t2.c2 OFFSET 100 LIMIT 10;
!                             QUERY PLAN                            
! ------------------------------------------------------------------
   Limit
!    Output: t1.c2, t2.c2
     ->  Sort
!          Output: t1.c2, t2.c2
!          Sort Key: t1.c2, t2.c2
           ->  Nested Loop
!                Output: t1.c2, t2.c2
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c2
!                      Remote SQL: SELECT c2 FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c2
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c2
!                            Remote SQL: SELECT c2 FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c2, t2.c2 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c2, t2.c2 OFFSET 100 LIMIT 10;
!  c2 | c2 
! ----+----
!   0 |  0
!   0 |  0
!   0 |  0
!   0 |  0
!   0 |  0
!   0 |  0
!   0 |  0
!   0 |  0
!   0 |  0
!   0 |  0
  (10 rows)
  
  -- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2377,2394 ****
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                 QUERY PLAN                                                
! ----------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          ->  Foreign Scan
!                Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
!                Relations: Aggregate on (public.ft1)
!                Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
--- 2377,2397 ----
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                         QUERY PLAN                                                        
! --------------------------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Incremental Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          Presorted Key: ft1.c2
!          ->  GroupAggregate
!                Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
!                Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
!                ->  Foreign Scan on public.ft1
!                      Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
!                      Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 56b01d0..a9f7111
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 462,469 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 462,469 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c2, t2.c2 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c2, t2.c2 OFFSET 100 LIMIT 10;
! SELECT t1.c2, t2.c2 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c2, t2.c2 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 95afc2c..049d470
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3524,3529 ****
--- 3524,3543 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-incsort" xreflabel="enable_incsort">
+       <term><varname>enable_incsort</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_incsort</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of incremental sort
+         steps. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
        <term><varname>enable_indexscan</varname> (<type>boolean</type>)
        <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index c9e0a3e..5020f5c
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_grouping_set_keys(PlanS
*** 92,98 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 92,98 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** ExplainNode(PlanState *planstate, List *
*** 974,979 ****
--- 974,982 ----
  		case T_Sort:
  			pname = sname = "Sort";
  			break;
+ 		case T_IncSort:
+ 			pname = sname = "Incremental Sort";
+ 			break;
  		case T_Group:
  			pname = sname = "Group";
  			break;
*************** ExplainNode(PlanState *planstate, List *
*** 1504,1509 ****
--- 1507,1513 ----
  										   planstate, es);
  			break;
  		case T_Sort:
+ 		case T_IncSort:
  			show_sort_keys(castNode(SortState, planstate), ancestors, es);
  			show_sort_info(castNode(SortState, planstate), es);
  			break;
*************** static void
*** 1832,1840 ****
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1836,1850 ----
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+ 	int			skipCols;
+ 
+ 	if (IsA(plan, IncSort))
+ 		skipCols = ((IncSort *) plan)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_merge_append_keys(MergeAppendState 
*** 1850,1856 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1860,1866 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1874,1880 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1884,1890 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 1930,1936 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 1940,1946 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 1987,1993 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 1997,2003 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2000,2012 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 2010,2023 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2046,2054 ****
--- 2057,2069 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2195,2206 ****
--- 2210,2230 ----
  			appendStringInfoSpaces(es->str, es->indent * 2);
  			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
  							 sortMethod, spaceType, spaceUsed);
+ 			if (sortstate->skipKeys)
+ 			{
+ 				appendStringInfoSpaces(es->str, es->indent * 2);
+ 				appendStringInfo(es->str, "Sort groups: %ld\n",
+ 								 sortstate->groupsCount);
+ 			}
  		}
  		else
  		{
  			ExplainPropertyText("Sort Method", sortMethod, es);
  			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
  			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			if (sortstate->skipKeys)
+ 				ExplainPropertyLong("Sort groups: %ld",
+ 									sortstate->groupsCount, es);
  		}
  	}
  }
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index d380207..c6c3ab7
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
*************** ExecReScan(PlanState *node)
*** 235,240 ****
--- 235,241 ----
  			break;
  
  		case T_SortState:
+ 		case T_IncSortState:
  			ExecReScanSort((SortState *) node);
  			break;
  
*************** ExecSupportsBackwardScan(Plan *node)
*** 509,516 ****
--- 510,521 ----
  		case T_CteScan:
  		case T_Material:
  		case T_Sort:
+ 			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_IncSort:
+ 			return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 0dd95c6..3cc4b77
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
*************** ExecInitNode(Plan *node, EState *estate,
*** 291,296 ****
--- 291,297 ----
  			break;
  
  		case T_Sort:
+ 		case T_IncSort:
  			result = (PlanState *) ExecInitSort((Sort *) node,
  												estate, eflags);
  			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index aa08152..aa4d8e2
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 559,564 ****
--- 559,565 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 637,643 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 638,644 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 591a31a..28272be
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,123 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
+ #include "utils/lsyscache.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleTableSlot *a, TupleTableSlot *b)
+ {
+ 	int n, i;
+ 
+ 	Assert(IsA(node->ss.ps.plan, IncSort));
+ 
+ 	n = ((IncSort *) node->ss.ps.plan)->skipCols;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = slot_getattr(a, attno, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(SortState *node)
+ {
+ 	IncSort	   *plannode;
+ 	int			skipCols,
+ 				i;
+ 
+ 	plannode = (IncSort *) node->ss.ps.plan;
+ 	Assert(IsA(plannode, IncSort));
+ 	skipCols = plannode->skipCols;
+ 
+ 	node->skipKeys = (SkipKeyData *)palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sort.sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 										plannode->sort.sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sort.sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->sort.collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 140,155 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+ 	PlanState  *outerNode;
+ 	int			skipCols;
+ 	TupleDesc	tupDesc;
+ 	int64		nTuples = 0;
+ 
+ 	if (IsA(plannode, IncSort))
+ 		skipCols = ((IncSort *) plannode)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,87 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
! 
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
  											  plannode->numCols,
  											  plannode->sortColIdx,
--- 162,204 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
+ 	if (skipCols == 0)
+ 	{
+ 		/* Regular case: no skip cols */
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
  											  plannode->numCols,
  											  plannode->sortColIdx,
*************** ExecSort(SortState *node)
*** 89,132 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
  		{
! 			slot = ExecProcNode(outerNode);
  
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
  	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
--- 206,355 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
! 	}
! 	else
! 	{
! 		/* Incremental sort case */
! 		if (node->tuplesortstate == NULL)
! 		{
! 			/*
! 			 * We are going to process the first group of presorted data.
! 			 * Initialize support structures for cmpSortSkipCols - already
! 			 * sorted columns.
! 			 */
! 			prepareSkipCols(node);
  
! 			/*
! 			 * Only pass on remaining columns that are unsorted.  Skip
! 			 * abbreviated keys usage for incremental sort.  We unlikely will
! 			 * have huge groups with incremental sort.  Therefore usage of
! 			 * abbreviated keys would be likely a waste of time.
! 			 */
! 			tuplesortstate = tuplesort_begin_heap(
! 										tupDesc,
! 										plannode->numCols - skipCols,
! 										&(plannode->sortColIdx[skipCols]),
! 										&(plannode->sortOperators[skipCols]),
! 										&(plannode->collations[skipCols]),
! 										&(plannode->nullsFirst[skipCols]),
! 										work_mem,
! 										false,
! 										true);
! 			node->tuplesortstate = (void *) tuplesortstate;
! 			node->groupsCount++;
! 		}
! 		else
  		{
! 			/* Next group of presorted data */
! 			tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
! 			node->groupsCount++;
! 		}
  
+ 		/* Calculate remaining bound for bounded sort */
+ 		if (node->bounded)
+ 			tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ 	}
+ 
+ 	/*
+ 	 * Put next group of tuples where skipCols sort values are equal to
+ 	 * tuplesort.
+ 	 */
+ 	for (;;)
+ 	{
+ 		slot = ExecProcNode(outerNode);
+ 
+ 		if (skipCols == 0)
+ 		{
+ 			/* Regular sort case: put all tuples to the tuplesort */
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
+ 			nTuples++;
  		}
+ 		else
+ 		{
+ 			/* Incremental sort case: put group of presorted data to the tuplesort */
+ 			if (node->prevSlot->tts_isempty)
+ 			{
+ 				/* First tuple */
+ 				if (TupIsNull(slot))
+ 				{
+ 					node->finished = true;
+ 					break;
+ 				}
+ 				else
+ 				{
+ 					ExecCopySlot(node->prevSlot, slot);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				/* Put previous tuple into tuplesort */
+ 				tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ 				nTuples++;
  
! 				if (TupIsNull(slot))
! 				{
! 					node->finished = true;
! 					break;
! 				}
! 				else
! 				{
! 					bool	cmp;
! 					cmp = cmpSortSkipCols(node, node->prevSlot, slot);
  
! 					/* Replace previous tuple with current one */
! 					ExecCopySlot(node->prevSlot, slot);
  
! 					/*
! 					 * When skipCols are not equal then group of presorted data
! 					 * is finished
! 					 */
! 					if (!cmp)
! 						break;
! 				}
! 			}
! 		}
  	}
  
+ 	/*
+ 	 * Complete the sort.
+ 	 */
+ 	tuplesort_performsort(tuplesortstate);
+ 
+ 	/*
+ 	 * restore to user specified direction
+ 	 */
+ 	estate->es_direction = dir;
+ 
+ 	/*
+ 	 * finally set the sorted flag to true
+ 	 */
+ 	node->sort_Done = true;
+ 	node->bounded_Done = node->bounded;
+ 
+ 	/*
+ 	 * Adjust bound_Done with number of tuples we've actually sorted.
+ 	 */
+ 	if (node->bounded)
+ 	{
+ 		if (node->finished)
+ 			node->bound_Done = node->bound;
+ 		else
+ 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ 	}
+ 
+ 	SO1_printf("ExecSort: %s\n", "sorting done");
+ 
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
*************** ExecInitSort(Sort *node, EState *estate,
*** 157,162 ****
--- 380,394 ----
  			   "initializing sort node");
  
  	/*
+ 	 * skipCols can't be used with either EXEC_FLAG_REWIND, EXEC_FLAG_BACKWARD
+ 	 * or EXEC_FLAG_MARK, because we hold only current bucket in
+ 	 * tuplesortstate.
+ 	 */
+ 	Assert(IsA(node, Sort) || (eflags & (EXEC_FLAG_REWIND |
+ 										 EXEC_FLAG_BACKWARD |
+ 										 EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
  	 * create state structure
  	 */
  	sortstate = makeNode(SortState);
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 406,417 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prevSlot = NULL;
+ 	sortstate->bound_Done = 0;
+ 	sortstate->groupsCount = 0;
+ 	sortstate->skipKeys = NULL;
  
  	/*
  	 * Miscellaneous initialization
*************** ExecInitSort(Sort *node, EState *estate,
*** 209,214 ****
--- 446,455 ----
  	ExecAssignScanTypeFromOuterPlan(&sortstate->ss);
  	sortstate->ss.ps.ps_ProjInfo = NULL;
  
+ 	/* make standalone slot to store previous tuple from outer node */
+ 	sortstate->prevSlot = MakeSingleTupleTableSlot(
+ 								ExecGetResultType(outerPlanState(sortstate)));
+ 
  	SO1_printf("ExecInitSort: %s\n",
  			   "sort node initialized");
  
*************** ExecEndSort(SortState *node)
*** 231,236 ****
--- 472,479 ----
  	ExecClearTuple(node->ss.ss_ScanTupleSlot);
  	/* must drop pointer to sort result tuple */
  	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	/* must drop stanalone tuple slot from outer node */
+ 	ExecDropSingleTupleTableSlot(node->prevSlot);
  
  	/*
  	 * Release tuplesort resources
*************** ExecReScanSort(SortState *node)
*** 318,323 ****
--- 561,567 ----
  		node->sort_Done = false;
  		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
  		node->tuplesortstate = NULL;
+ 		node->bound_Done = 0;
  
  		/*
  		 * if chgParam of subnode is not null then plan will be re-scanned by
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 05d8538..8c47f44
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 837,842 ****
--- 837,860 ----
  
  
  /*
+  * CopySortFields
+  *
+  *		This function copies the fields of the Sort node.  It is used by
+  *		all the copy functions for classes which inherit from Sort.
+  */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ 
+ 	COPY_SCALAR_FIELD(numCols);
+ 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+ 
+ /*
   * _copySort
   */
  static Sort *
*************** _copySort(const Sort *from)
*** 847,859 ****
  	/*
  	 * copy node superclass fields
  	 */
! 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
! 	COPY_SCALAR_FIELD(numCols);
! 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
  
  	return newnode;
  }
--- 865,893 ----
  	/*
  	 * copy node superclass fields
  	 */
! 	CopySortFields(from, newnode);
  
! 	return newnode;
! }
! 
! 
! /*
!  * _copyIncSort
!  */
! static IncSort *
! _copyIncSort(const IncSort *from)
! {
! 	IncSort	   *newnode = makeNode(IncSort);
! 
! 	/*
! 	 * copy node superclass fields
! 	 */
! 	CopySortFields((const Sort *) from, (Sort *) newnode);
! 
! 	/*
! 	 * copy remainder of node
! 	 */
! 	COPY_SCALAR_FIELD(skipCols);
  
  	return newnode;
  }
*************** copyObject(const void *from)
*** 4583,4588 ****
--- 4617,4625 ----
  		case T_Sort:
  			retval = _copySort(from);
  			break;
+ 		case T_IncSort:
+ 			retval = _copyIncSort(from);
+ 			break;
  		case T_Group:
  			retval = _copyGroup(from);
  			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index b3802b4..7522cc3
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 781,792 ****
  }
  
  static void
! _outSort(StringInfo str, const Sort *node)
  {
  	int			i;
  
- 	WRITE_NODE_TYPE("SORT");
- 
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
--- 781,790 ----
  }
  
  static void
! _outSortInfo(StringInfo str, const Sort *node)
  {
  	int			i;
  
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 809,814 ****
--- 807,830 ----
  }
  
  static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ 	WRITE_NODE_TYPE("SORT");
+ 
+ 	_outSortInfo(str, node);
+ }
+ 
+ static void
+ _outIncSort(StringInfo str, const IncSort *node)
+ {
+ 	WRITE_NODE_TYPE("INCSORT");
+ 
+ 	_outSortInfo(str, (const Sort *) node);
+ 
+ 	WRITE_INT_FIELD(skipCols);
+ }
+ 
+ static void
  _outUnique(StringInfo str, const Unique *node)
  {
  	int			i;
*************** outNode(StringInfo str, const void *obj)
*** 3482,3487 ****
--- 3498,3506 ----
  			case T_Sort:
  				_outSort(str, obj);
  				break;
+ 			case T_IncSort:
+ 				_outIncSort(str, obj);
+ 				break;
  			case T_Unique:
  				_outUnique(str, obj);
  				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index d2f69fe..0dcb86e
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 1978,1989 ****
  }
  
  /*
!  * _readSort
   */
! static Sort *
! _readSort(void)
  {
! 	READ_LOCALS(Sort);
  
  	ReadCommonPlan(&local_node->plan);
  
--- 1978,1990 ----
  }
  
  /*
!  * ReadCommonSort
!  *	Assign the basic stuff of all nodes that inherit from Sort
   */
! static void
! ReadCommonSort(Sort *local_node)
  {
! 	READ_TEMP_LOCALS();
  
  	ReadCommonPlan(&local_node->plan);
  
*************** _readSort(void)
*** 1992,1997 ****
--- 1993,2024 ----
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
  	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+ 
+ /*
+  * _readSort
+  */
+ static Sort *
+ _readSort(void)
+ {
+ 	READ_LOCALS_NO_FIELDS(Sort);
+ 
+ 	ReadCommonSort(local_node);
+ 
+ 	READ_DONE();
+ }
+ 
+ /*
+  * _readIncSort
+  */
+ static IncSort *
+ _readIncSort(void)
+ {
+ 	READ_LOCALS(IncSort);
+ 
+ 	ReadCommonSort(&local_node->sort);
+ 
+ 	READ_INT_FIELD(skipCols);
  
  	READ_DONE();
  }
*************** parseNodeString(void)
*** 2520,2525 ****
--- 2547,2554 ----
  		return_value = _readMaterial();
  	else if (MATCH("SORT", 4))
  		return_value = _readSort();
+ 	else if (MATCH("INCSORT", 7))
+ 		return_value = _readIncSort();
  	else if (MATCH("GROUP", 5))
  		return_value = _readGroup();
  	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index eeacf81..22f81aa
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3097,3102 ****
--- 3097,3106 ----
  			ptype = "Sort";
  			subpath = ((SortPath *) path)->subpath;
  			break;
+ 		case T_IncSortPath:
+ 			ptype = "IncSort";
+ 			subpath = ((SortPath *) path)->subpath;
+ 			break;
  		case T_GroupPath:
  			ptype = "Group";
  			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index d01630f..b98abc7
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
  bool		enable_bitmapscan = true;
  bool		enable_tidscan = true;
  bool		enable_sort = true;
+ bool		enable_incsort = true;
  bool		enable_hashagg = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
*************** cost_recursive_union(Path *runion, Path 
*** 1419,1424 ****
--- 1420,1432 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or incremental sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1445,1451 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1453,1460 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1461,1479 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
  
  	path->rows = tuples;
  
--- 1470,1497 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
+ 	if (!enable_incsort)
+ 		presorted_keys = false;
  
  	path->rows = tuples;
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1499,1511 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1517,1566 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1515,1521 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1570,1576 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1526,1535 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1581,1590 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1537,1550 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1592,1617 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2300,2305 ****
--- 2367,2374 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2326,2331 ****
--- 2395,2402 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 1065b31..653e4e9
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
  	return PATHKEYS_EQUAL;
  }
  
+ 
+ /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
  /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 368,375 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 397,408 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied only partially then we would have to do
!  *	  incremental sort in order to satisfy pathkeys completely.  Since
!  *	  incremental sort consumes data by presorted groups, we would have to
!  *	  consume more data than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1461,1486 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
! 	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
- 
- 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1494,1535 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by incremental sort.
   */
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
  {
! 	int	n_common_pathkeys;
! 
! 	if (query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
! 
! 	if (enable_incsort)
  	{
! 		/*
! 		 * Return the number of path keys in common, or 0 if there are none. Any
! 		 * first common pathkeys could be useful for ordering because we can use
! 		 * incremental sort.
! 		 */
! 		return n_common_pathkeys;
! 	}
! 	else
! 	{
! 		/* 
! 		 * When incremental sort is disabled, pathkeys are useful only when they
! 		 * do contain all the query pathkeys.
! 		 */
! 		if (n_common_pathkeys == list_length(query_pathkeys))
! 			return n_common_pathkeys;
! 		else
! 			return 0;
  	}
  }
  
  /*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1496,1502 ****
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
--- 1545,1551 ----
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 997bdcf..f103b04
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 227,233 ****
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 227,233 ----
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 242,251 ****
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 242,253 ----
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 423,428 ****
--- 425,431 ----
  											   (GatherPath *) best_path);
  			break;
  		case T_Sort:
+ 		case T_IncSort:
  			plan = (Plan *) create_sort_plan(root,
  											 (SortPath *) best_path,
  											 flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1068,1073 ****
--- 1071,1077 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1102,1110 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1106,1116 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1535,1540 ****
--- 1541,1547 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1544,1550 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1551,1561 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1790,1796 ****
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
--- 1801,1808 ----
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan,
! 										 0);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3624,3631 ****
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3636,3649 ----
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3636,3643 ****
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3654,3667 ----
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4692,4698 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4716,4723 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5214,5226 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node = makeNode(Sort);
! 	Plan	   *plan = &node->plan;
  
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
--- 5239,5269 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node;
! 	Plan	   *plan;
  
+ 	/* Always use regular sort node when enable_incsort = false */
+ 	if (!enable_incsort)
+ 		skipCols = 0;
+ 
+ 	if (skipCols == 0)
+ 	{
+ 		node = makeNode(Sort);
+ 	}
+ 	else
+ 	{
+ 		IncSort    *incSort;
+ 
+ 		incSort = makeNode(IncSort);
+ 		node = &incSort->sort;
+ 		incSort->skipCols = skipCols;
+ 	}
+ 
+ 	plan = &node->plan;
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5552,5558 ****
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5595,5601 ----
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5572,5578 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5615,5621 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5615,5621 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5658,5664 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5636,5642 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5679,5686 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5669,5675 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5713,5719 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** is_projection_capable_plan(Plan *plan)
*** 6317,6322 ****
--- 6361,6367 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index c3fbf3c..5fe1235
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 3d33d46..557f885
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3497,3510 ****
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				bool		is_sorted;
  
! 				is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 												  path->pathkeys);
! 				if (path == cheapest_partial_path || is_sorted)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (!is_sorted)
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
--- 3497,3510 ----
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				int			n_useful_pathkeys;
  
! 				n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (n_useful_pathkeys < list_length(root->group_pathkeys))
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3577,3590 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3577,3590 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_useful_pathkeys;
  
! 			n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 			if (path == cheapest_path || n_useful_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4240,4252 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 4240,4252 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_useful_pathkeys;
  
! 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! 														 path->pathkeys);
! 		if (path == cheapest_input_path || n_useful_pathkeys > 0)
  		{
! 			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 5325,5332 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 5325,5333 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index be267b9..7835cc4
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 610,615 ****
--- 610,616 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncSort:
  		case T_Unique:
  		case T_SetOp:
  
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 7954c44..4df783e
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2693,2698 ****
--- 2693,2699 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncSort:
  		case T_Unique:
  		case T_Gather:
  		case T_SetOp:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index 06e843d..f3b9717
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 963,969 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 963,970 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 3248296..2777aca
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1293,1304 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1293,1305 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1312,1317 ****
--- 1313,1320 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1548,1554 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1551,1558 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2399,2407 ****
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode = makeNode(SortPath);
  
- 	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
--- 2403,2433 ----
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode;
! 	int			n_common_pathkeys;
! 
! 	if (enable_incsort)
! 		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! 	else
! 		n_common_pathkeys = 0;
! 
! 	if (n_common_pathkeys == 0)
! 	{
! 		pathnode = makeNode(SortPath);
! 		pathnode->path.pathtype = T_Sort;
! 	}
! 	else
! 	{
! 		IncSortPath   *incpathnode;
! 
! 		incpathnode = makeNode(IncSortPath);
! 		pathnode = &incpathnode->spath;
! 		pathnode->path.pathtype = T_IncSort;
! 		incpathnode->skipCols = n_common_pathkeys;
! 	}
! 
! 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2415,2421 ****
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2441,2449 ----
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2687,2693 ****
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
--- 2715,2722 ----
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL, 0,
! 					  0.0,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index f9f18f2..9607889
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 276,282 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 276,282 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index d14f0f9..a8fd978
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3521,3526 ****
--- 3521,3562 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucketsize fraction (ie, number of entries in a bucket
   * divided by total tuples in relation) if the specified expression is used
   * as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 5d8fb2e..46a2c16
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 857,862 ****
--- 857,871 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_incsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of incremental sort steps."),
+ 			NULL
+ 		},
+ 		&enable_incsort,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hashed aggregation plans."),
  			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index e1e692d..af93ae4
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 281,286 ****
--- 281,291 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	TupSortStatus maxStatus;	/* maximum status reached between sort groups */
+ 	int64		maxMem;			/* maximum amount of memory used between
+ 								   sort groups */
+ 	bool		maxMemOnDisk;	/* is maxMem value for on-disk memory */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 633,638 ****
--- 638,646 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 667,685 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 675,704 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 696,702 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 715,721 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 714,719 ****
--- 733,739 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 754,766 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 774,787 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 802,808 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 823,829 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 833,839 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 854,860 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 924,930 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 945,951 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 997,1003 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1018,1024 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1034,1040 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1055,1061 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1145,1160 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1166,1177 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1213,1219 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1230,1327 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	memUsed;
! 	bool	memUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		memUsedOnDisk = true;
! 		memUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		memUsedOnDisk = false;
! 		memUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	state->maxStatus = Max(state->maxStatus, state->status);
! 	if (memUsed > state->maxMem)
! 	{
! 		state->maxMem = memUsed;
! 		state->maxMemOnDisk = memUsedOnDisk;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->availMem = state->allowedMem;
! 	state->lastReturnedTuple = NULL;
! 	state->slabAllocatorUsed = false;
! 	state->slabMemoryBegin = NULL;
! 	state->slabMemoryEnd = NULL;
! 	state->slabFreeHead = NULL;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3219,3245 ****
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
  		*spaceType = "Disk";
- 		*spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		*spaceType = "Memory";
! 		*spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3327,3341 ----
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxMemOnDisk)
  		*spaceType = "Disk";
  	else
  		*spaceType = "Memory";
! 	*spaceUsed = (state->maxMem + 1023) / 1024;
  
! 	switch (state->maxStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 9f41bab..c95cb42
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1814,1819 ****
--- 1814,1833 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1825,1833 ****
--- 1839,1852 ----
  	bool		bounded;		/* is the result set bounded? */
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	int64		groupsCount;	/* number of groups with equal skip keys */
+ 	TupleTableSlot *prevSlot;	/* slot for previous tuple from outer node */
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index 95dd8ba..7a5dcf5
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 71,76 ****
--- 71,77 ----
  	T_HashJoin,
  	T_Material,
  	T_Sort,
+ 	T_IncSort,
  	T_Group,
  	T_Agg,
  	T_WindowAgg,
*************** typedef enum NodeTag
*** 120,125 ****
--- 121,127 ----
  	T_HashJoinState,
  	T_MaterialState,
  	T_SortState,
+ 	T_IncSortState,
  	T_GroupState,
  	T_AggState,
  	T_WindowAggState,
*************** typedef enum NodeTag
*** 249,254 ****
--- 251,257 ----
  	T_ProjectionPath,
  	T_ProjectSetPath,
  	T_SortPath,
+ 	T_IncSortPath,
  	T_GroupPath,
  	T_UpperUniquePath,
  	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index f72f7a8..6b96535
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 699,704 ****
--- 699,715 ----
  	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
  } Sort;
  
+ 
+ /* ----------------
+  *		incremental sort node
+  * ----------------
+  */
+ typedef struct IncSort
+ {
+ 	Sort		sort;
+ 	int			skipCols;		/* number of presorted columns */
+ } IncSort;
+ 
  /* ---------------
   *	 group node -
   *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index f7ac6f6..2c56105
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1331,1336 ****
--- 1331,1346 ----
  } SortPath;
  
  /*
+  * IncSortPath
+  */
+ typedef struct IncSortPath
+ {
+ 	SortPath	spath;
+ 	int			skipCols;
+ } IncSortPath;
+ 
+ 
+ /*
   * GroupPath represents grouping (of presorted input)
   *
   * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 72200fa..c26ef9a
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_bitmapscan;
  extern bool enable_tidscan;
  extern bool enable_sort;
+ extern bool enable_incsort;
  extern bool enable_hashagg;
  extern bool enable_nestloop;
  extern bool enable_material;
*************** extern void cost_ctescan(Path *path, Pla
*** 95,102 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 96,104 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index ebda308..3271203
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 180,185 ****
--- 180,186 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
*************** extern List *select_outer_pathkeys_for_m
*** 216,221 ****
--- 217,223 ----
  extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
  							  List *mergeclauses,
  							  List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
  extern List *truncate_useless_pathkeys(PlannerInfo *root,
  						  RelOptInfo *rel,
  						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
  						 double nbuckets);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5b3f475..616f9f5
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 62,69 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 104,109 ****
--- 105,112 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort           
*** 19,27 ****
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Sort           
    Sort Key: id, data
!   ->  Seq Scan on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
--- 19,28 ----
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Incremental Sort
    Sort Key: id, data
!   Presorted Key: id
!   ->  Index Scan using test_dc_pkey on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index 0ff8062..3ad5eb3
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t
*** 996,1010 ****
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                       QUERY PLAN                       
! -------------------------------------------------------
!  HashAggregate
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Merge Join
!          Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!          ->  Index Scan using t1_pkey on t1
!          ->  Index Scan using t2_pkey on t2
! (6 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
--- 996,1013 ----
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                          QUERY PLAN                          
! -------------------------------------------------------------
!  Group
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Incremental Sort
!          Sort Key: t1.a, t1.b, t2.z
!          Presorted Key: t1.a, t1.b
!          ->  Merge Join
!                Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!                ->  Index Scan using t1_pkey on t1
!                ->  Index Scan using t2_pkey on t2
! (9 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index a8c8b28..11d697e
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE:  drop cascades to table matest1
*** 1448,1453 ****
--- 1448,1454 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incsort = off;
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
  SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1588,1596 ****
--- 1589,1633 ----
   {3,7,8,10,13,13,16,18,19,22}
  (3 rows)
  
+ set enable_incsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+                                QUERY PLAN                                
+ -------------------------------------------------------------------------
+  Merge Append
+    Sort Key: tenk1.thousand, tenk1.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+    ->  Incremental Sort
+          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
+          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Merge Append
+    Sort Key: a.thousand, a.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+    ->  Incremental Sort
+          Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
+          ->  Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incsort;
  --
  -- Check that constraint exclusion works correctly with partitions using
  -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index d48abd7..f6a99d1
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select name, setting from pg_settings wh
*** 75,80 ****
--- 75,81 ----
   enable_bitmapscan    | on
   enable_hashagg       | on
   enable_hashjoin      | on
+  enable_incsort       | on
   enable_indexonlyscan | on
   enable_indexscan     | on
   enable_material      | on
*************** select name, setting from pg_settings wh
*** 83,89 ****
   enable_seqscan       | on
   enable_sort          | on
   enable_tidscan       | on
! (11 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
--- 84,90 ----
   enable_seqscan       | on
   enable_sort          | on
   enable_tidscan       | on
! (12 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index a8b7eb1..5cf4426
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 498,503 ****
--- 498,504 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incsort = off;
  
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
*************** SELECT
*** 559,567 ****
--- 560,585 ----
      ORDER BY f.i LIMIT 10)
  FROM generate_series(1, 3) g(i);
  
+ set enable_incsort = on;
+ 
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incsort;
  
  --
  -- Check that constraint exclusion works correctly with partitions using

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Alexander Korotkov (#1)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Feb 18, 2017 at 4:01 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

I decided to start new thread for this patch for following two reasons.
* It's renamed from "Partial sort" to "Incremental sort" per suggestion by
Robert Haas [1]. New name much better characterizes the essence of
algorithm.
* I think it's not PoC anymore. Patch received several rounds of review
and now it's in the pretty good shape.

Attached revision of patch has following changes.
* According to review [1], two new path and plan nodes are responsible for
incremental sort: IncSortPath and IncSort which are inherited from SortPath
and Sort correspondingly. That allowed to get rid of set of hacks with
minimal code changes.
* According to review [1] and comment [2], previous tuple is stored in
standalone tuple slot of SortState rather than just HeapTuple.
* New GUC parameter enable_incsort is introduced to control planner ability
to choose incremental sort.
* Test of postgres_fdw with not pushed down cross join is corrected. It
appeared that with incremental sort such query is profitable to push down.
I changed ORDER BY columns so that index couldn't be used. I think this
solution is more elegant than setting enable_incsort = off.

I usually advocate for spelling things out instead of abbreviating, so
I guess I'll stay true to form here and suggest that abbreviating
incremental to inc doesn't seem like a great idea. Is that sort
incrementing, incremental, incredible, incautious, or incorporated?

The first hunk in the patch, a change in the postgres_fdw regression
test output, looks an awful lot like a bug: now the query that
formerly returned various different numbers is returning all zeroes.
It might not actually be a bug, because you've also changed the test
query (not sure why), but anyway the new regression test output that
is all zeroes seems less useful for catching bugs in, say, the
ordering of the results than the old output where the different rows
were different.

I don't know of any existing cases where the same executor file is
responsible for executing more than 1 different type of executor node.
I was imagining a more-complete separation of the new executor node.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 9 years ago

In reply to: Robert Haas (#2)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Feb 19, 2017 at 2:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Feb 18, 2017 at 4:01 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

I decided to start new thread for this patch for following two reasons.
* It's renamed from "Partial sort" to "Incremental sort" per suggestion

by

Robert Haas [1]. New name much better characterizes the essence of
algorithm.
* I think it's not PoC anymore. Patch received several rounds of review
and now it's in the pretty good shape.

Attached revision of patch has following changes.
* According to review [1], two new path and plan nodes are responsible

for

incremental sort: IncSortPath and IncSort which are inherited from

SortPath

and Sort correspondingly. That allowed to get rid of set of hacks with
minimal code changes.
* According to review [1] and comment [2], previous tuple is stored in
standalone tuple slot of SortState rather than just HeapTuple.
* New GUC parameter enable_incsort is introduced to control planner

ability

to choose incremental sort.
* Test of postgres_fdw with not pushed down cross join is corrected. It
appeared that with incremental sort such query is profitable to push

down.

I changed ORDER BY columns so that index couldn't be used. I think this
solution is more elegant than setting enable_incsort = off.

I usually advocate for spelling things out instead of abbreviating, so
I guess I'll stay true to form here and suggest that abbreviating
incremental to inc doesn't seem like a great idea. Is that sort
incrementing, incremental, incredible, incautious, or incorporated?

I'm not that sure about naming of GUCs, because we already
have enable_hashagg instead of enable_hashaggregate, enable_material
instead of enable_materialize, enable_nestloop instead
of enable_nestedloop. But anyway I renamed "inc" to "Incremental"
everywhere in the code. I renamed enable_incsort GUC into
enable_incrementalsort as well, because I don't have strong opinion here.

The first hunk in the patch, a change in the postgres_fdw regression

test output, looks an awful lot like a bug: now the query that
formerly returned various different numbers is returning all zeroes.
It might not actually be a bug, because you've also changed the test
query (not sure why), but anyway the new regression test output that
is all zeroes seems less useful for catching bugs in, say, the
ordering of the results than the old output where the different rows
were different.

Yes, I've changed regression test query as I mentioned in the previous
message. With incremental sort feature original query can't serve anymore
as an example of non-pushdown join. However, you're right that query which
returns all zeroes doesn't look good there either. So, I changed that
query to ordering by column "c3" which is actually non-indexed textual
representation of "c1".

I don't know of any existing cases where the same executor file is
responsible for executing more than 1 different type of executor node.
I was imagining a more-complete separation of the new executor node.

Ok, I put incremental sort into separate executor node.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-2.patchapplication/octet-stream; name=incremental-sort-2.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 0b9e3e4..2f8aa6f
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1803,1841 ****
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
   Limit
!    Output: t1.c1, t2.c1
     ->  Sort
!          Output: t1.c1, t2.c1
!          Sort Key: t1.c1, t2.c1
           ->  Nested Loop
!                Output: t1.c1, t2.c1
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c1
!                      Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c1
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c1
!                            Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!  c1 | c1  
! ----+-----
!   1 | 101
!   1 | 102
!   1 | 103
!   1 | 104
!   1 | 105
!   1 | 106
!   1 | 107
!   1 | 108
!   1 | 109
!   1 | 110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
--- 1803,1841 ----
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!                             QUERY PLAN                            
! ------------------------------------------------------------------
   Limit
!    Output: t1.c3, t2.c3
     ->  Sort
!          Output: t1.c3, t2.c3
!          Sort Key: t1.c3, t2.c3
           ->  Nested Loop
!                Output: t1.c3, t2.c3
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c3
!                      Remote SQL: SELECT c3 FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c3
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c3
!                            Remote SQL: SELECT c3 FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!   c3   |  c3   
! -------+-------
!  00001 | 00101
!  00001 | 00102
!  00001 | 00103
!  00001 | 00104
!  00001 | 00105
!  00001 | 00106
!  00001 | 00107
!  00001 | 00108
!  00001 | 00109
!  00001 | 00110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2377,2394 ****
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                 QUERY PLAN                                                
! ----------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          ->  Foreign Scan
!                Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
!                Relations: Aggregate on (public.ft1)
!                Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
--- 2377,2397 ----
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                         QUERY PLAN                                                        
! --------------------------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Incremental Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          Presorted Key: ft1.c2
!          ->  GroupAggregate
!                Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
!                Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
!                ->  Foreign Scan on public.ft1
!                      Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
!                      Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 56b01d0..8a61277
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 462,469 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 462,469 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 1b390a2..cda89a3
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3524,3529 ****
--- 3524,3543 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+       <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of incremental sort
+         steps. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
        <term><varname>enable_indexscan</varname> (<type>boolean</type>)
        <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index c9e0a3e..e1fe3b7
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual, 
*** 79,84 ****
--- 79,86 ----
  				ExplainState *es);
  static void show_sort_keys(SortState *sortstate, List *ancestors,
  			   ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 					   List *ancestors, ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 92,98 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 94,100 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 100,105 ****
--- 102,109 ----
  static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
  				 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 									   ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  					ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 974,979 ****
--- 978,986 ----
  		case T_Sort:
  			pname = sname = "Sort";
  			break;
+ 		case T_IncrementalSort:
+ 			pname = sname = "Incremental Sort";
+ 			break;
  		case T_Group:
  			pname = sname = "Group";
  			break;
*************** ExplainNode(PlanState *planstate, List *
*** 1507,1512 ****
--- 1514,1525 ----
  			show_sort_keys(castNode(SortState, planstate), ancestors, es);
  			show_sort_info(castNode(SortState, planstate), es);
  			break;
+ 		case T_IncrementalSort:
+ 			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ 									   ancestors, es);
+ 			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ 									   es);
+ 			break;
  		case T_MergeAppend:
  			show_merge_append_keys(castNode(MergeAppendState, planstate),
  								   ancestors, es);
*************** static void
*** 1832,1846 ****
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
   * Likewise, for a MergeAppend node.
   */
  static void
--- 1845,1882 ----
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+ 	int			skipCols;
+ 
+ 	if (IsA(plan, IncrementalSort))
+ 		skipCols = ((IncrementalSort *) plan)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
+  * Show the sort keys for a IncrementalSort node.
+  */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 						   List *ancestors, ExplainState *es)
+ {
+ 	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+ 
+ 	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ 						 plan->sort.numCols, plan->skipCols,
+ 						 plan->sort.sortColIdx,
+ 						 plan->sort.sortOperators, plan->sort.collations,
+ 						 plan->sort.nullsFirst,
+ 						 ancestors, es);
+ }
+ 
+ /*
   * Likewise, for a MergeAppend node.
   */
  static void
*************** show_merge_append_keys(MergeAppendState 
*** 1850,1856 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1886,1892 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1874,1880 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1910,1916 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 1930,1936 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 1966,1972 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 1987,1993 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 2023,2029 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2000,2012 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 2036,2049 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2046,2054 ****
--- 2083,2095 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2206,2211 ****
--- 2247,2289 ----
  }
  
  /*
+  * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+  */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 						   ExplainState *es)
+ {
+ 	if (es->analyze && incrsortstate->sort_Done &&
+ 		incrsortstate->tuplesortstate != NULL)
+ 	{
+ 		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ 		const char *sortMethod;
+ 		const char *spaceType;
+ 		long		spaceUsed;
+ 
+ 		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+ 							 sortMethod, spaceType, spaceUsed);
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort groups: %ld\n",
+ 							 incrsortstate->groupsCount);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyText("Sort Method", sortMethod, es);
+ 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			ExplainPropertyLong("Sort groups: %ld",
+ 								incrsortstate->groupsCount, es);
+ 		}
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 2a2b7eb..d80883d
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execGroup
*** 23,30 ****
         nodeLimit.o nodeLockRows.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
!        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
  
--- 23,31 ----
         nodeLimit.o nodeLockRows.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
!        nodeSort.o nodeIncrementalSort.o \
!        nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
  
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index d380207..16df1b2
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 238,243 ****
--- 239,248 ----
  			ExecReScanSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecReScanIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecReScanGroup((GroupState *) node);
  			break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 509,516 ****
--- 514,525 ----
  		case T_CteScan:
  		case T_Material:
  		case T_Sort:
+ 			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_IncrementalSort:
+ 			return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index ef6f35a..5c77ab1
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 92,97 ****
--- 92,98 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 295,300 ****
--- 296,306 ----
  												estate, eflags);
  			break;
  
+ 		case T_IncrementalSort:
+ 			result = (PlanState *) ExecInitIncrementalSort(
+ 									(IncrementalSort *) node, estate, eflags);
+ 			break;
+ 
  		case T_Group:
  			result = (PlanState *) ExecInitGroup((Group *) node,
  												 estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 505,510 ****
--- 511,520 ----
  			result = ExecSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			result = ExecIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			result = ExecGroup((GroupState *) node);
  			break;
*************** ExecEndNode(PlanState *node)
*** 761,766 ****
--- 771,780 ----
  			ExecEndSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecEndIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecEndGroup((GroupState *) node);
  			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index aa08152..aa4d8e2
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 559,564 ****
--- 559,565 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 637,643 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 638,644 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...04576c6
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,485 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncremenalSort.c
+  *	  Routines to handle incremental sorting of relations.
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/executor/nodeIncremenalSort.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+ 
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ 															TupleTableSlot *b)
+ {
+ 	int n, i;
+ 
+ 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+ 
+ 	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = slot_getattr(a, attno, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	int					skipCols,
+ 						i;
+ 
+ 	Assert(IsA(plannode, IncrementalSort));
+ 	skipCols = plannode->skipCols;
+ 
+ 	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sort.sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 										plannode->sort.sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sort.sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->sort.collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
+ 
+ /* ----------------------------------------------------------------
+  *		ExecIncrementalSort
+  *
+  *		Assuming that outer subtree returns tuple presorted by some prefix
+  *		of target sort columns, performs incremental sort.  It fetches
+  *		groups of tuples where prefix sort columns are equal and sorts them
+  *		using tuplesort.  This approach allows to evade sorting of whole
+  *		dataset.  Besides taking less memory and being faster, it allows to
+  *		start returning tuples before fetching full dataset from outer
+  *		subtree.
+  *
+  *		Conditions:
+  *		  -- none.
+  *
+  *		Initial States:
+  *		  -- the outer child is prepared to return the first tuple.
+  * ----------------------------------------------------------------
+  */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ 	EState			   *estate;
+ 	ScanDirection		dir;
+ 	Tuplesortstate	   *tuplesortstate;
+ 	TupleTableSlot	   *slot;
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	PlanState		   *outerNode;
+ 	int					skipCols;
+ 	TupleDesc			tupDesc;
+ 	int64				nTuples = 0;
+ 
+ 	skipCols = plannode->skipCols;
+ 
+ 	/*
+ 	 * get state info from node
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "entering routine");
+ 
+ 	estate = node->ss.ps.state;
+ 	dir = estate->es_direction;
+ 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+ 
+ 	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
+ 	 * If first time through, read all tuples from outer plan and pass them to
+ 	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ 	 */
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "sorting subplan");
+ 
+ 	/*
+ 	 * Want to scan subplan in the forward direction while creating the
+ 	 * sorted data.
+ 	 */
+ 	estate->es_direction = ForwardScanDirection;
+ 
+ 	/*
+ 	 * Initialize tuplesort module.
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "calling tuplesort_begin");
+ 
+ 	outerNode = outerPlanState(node);
+ 	tupDesc = ExecGetResultType(outerNode);
+ 
+ 	if (node->tuplesortstate == NULL)
+ 	{
+ 		/*
+ 		 * We are going to process the first group of presorted data.
+ 		 * Initialize support structures for cmpSortSkipCols - already
+ 		 * sorted columns.
+ 		 */
+ 		prepareSkipCols(node);
+ 
+ 		/*
+ 		 * Only pass on remaining columns that are unsorted.  Skip
+ 		 * abbreviated keys usage for incremental sort.  We unlikely will
+ 		 * have huge groups with incremental sort.  Therefore usage of
+ 		 * abbreviated keys would be likely a waste of time.
+ 		 */
+ 		tuplesortstate = tuplesort_begin_heap(
+ 									tupDesc,
+ 									plannode->sort.numCols - skipCols,
+ 									&(plannode->sort.sortColIdx[skipCols]),
+ 									&(plannode->sort.sortOperators[skipCols]),
+ 									&(plannode->sort.collations[skipCols]),
+ 									&(plannode->sort.nullsFirst[skipCols]),
+ 									work_mem,
+ 									false,
+ 									true);
+ 		node->tuplesortstate = (void *) tuplesortstate;
+ 		node->groupsCount++;
+ 	}
+ 	else
+ 	{
+ 		/* Next group of presorted data */
+ 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ 		node->groupsCount++;
+ 	}
+ 
+ 	/* Calculate remaining bound for bounded sort */
+ 	if (node->bounded)
+ 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ 
+ 	/*
+ 	 * Put next group of tuples where skipCols sort values are equal to
+ 	 * tuplesort.
+ 	 */
+ 	for (;;)
+ 	{
+ 		slot = ExecProcNode(outerNode);
+ 
+ 		/* Put next group of presorted data to the tuplesort */
+ 		if (node->prevSlot->tts_isempty)
+ 		{
+ 			/* First tuple */
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			else
+ 			{
+ 				ExecCopySlot(node->prevSlot, slot);
+ 			}
+ 		}
+ 		else
+ 		{
+ 			/* Put previous tuple into tuplesort */
+ 			tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ 			nTuples++;
+ 
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			else
+ 			{
+ 				bool	cmp;
+ 				cmp = cmpSortSkipCols(node, node->prevSlot, slot);
+ 
+ 				/* Replace previous tuple with current one */
+ 				ExecCopySlot(node->prevSlot, slot);
+ 
+ 				/*
+ 				 * When skipCols are not equal then group of presorted data
+ 				 * is finished
+ 				 */
+ 				if (!cmp)
+ 					break;
+ 			}
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Complete the sort.
+ 	 */
+ 	tuplesort_performsort(tuplesortstate);
+ 
+ 	/*
+ 	 * restore to user specified direction
+ 	 */
+ 	estate->es_direction = dir;
+ 
+ 	/*
+ 	 * finally set the sorted flag to true
+ 	 */
+ 	node->sort_Done = true;
+ 	node->bounded_Done = node->bounded;
+ 
+ 	/*
+ 	 * Adjust bound_Done with number of tuples we've actually sorted.
+ 	 */
+ 	if (node->bounded)
+ 	{
+ 		if (node->finished)
+ 			node->bound_Done = node->bound;
+ 		else
+ 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ 	}
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "retrieving tuple from tuplesort");
+ 
+ 	/*
+ 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+ 	 * tuples.
+ 	 */
+ 	slot = node->ss.ps.ps_ResultTupleSlot;
+ 	(void) tuplesort_gettupleslot(tuplesortstate,
+ 								  ScanDirectionIsForward(dir),
+ 								  slot, NULL);
+ 	return slot;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitIncrementalSort
+  *
+  *		Creates the run-time state information for the sort node
+  *		produced by the planner and initializes its outer subtree.
+  * ----------------------------------------------------------------
+  */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ 	IncrementalSortState   *incrsortstate;
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "initializing sort node");
+ 
+ 	/*
+ 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ 	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ 	 * bucket in tuplesortstate.
+ 	 */
+ 	Assert((eflags & (EXEC_FLAG_REWIND |
+ 					  EXEC_FLAG_BACKWARD |
+ 					  EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
+ 	 * create state structure
+ 	 */
+ 	incrsortstate = makeNode(IncrementalSortState);
+ 	incrsortstate->ss.ps.plan = (Plan *) node;
+ 	incrsortstate->ss.ps.state = estate;
+ 
+ 	incrsortstate->bounded = false;
+ 	incrsortstate->sort_Done = false;
+ 	incrsortstate->finished = false;
+ 	incrsortstate->tuplesortstate = NULL;
+ 	incrsortstate->prevSlot = NULL;
+ 	incrsortstate->bound_Done = 0;
+ 	incrsortstate->groupsCount = 0;
+ 	incrsortstate->skipKeys = NULL;
+ 
+ 	/*
+ 	 * Miscellaneous initialization
+ 	 *
+ 	 * Sort nodes don't initialize their ExprContexts because they never call
+ 	 * ExecQual or ExecProject.
+ 	 */
+ 
+ 	/*
+ 	 * tuple table initialization
+ 	 *
+ 	 * sort nodes only return scan tuples from their sorted relation.
+ 	 */
+ 	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ 	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+ 
+ 	/*
+ 	 * initialize child nodes
+ 	 *
+ 	 * We shield the child node from the need to support REWIND, BACKWARD, or
+ 	 * MARK/RESTORE.
+ 	 */
+ 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+ 
+ 	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+ 
+ 	/*
+ 	 * initialize tuple type.  no need to initialize projection info because
+ 	 * this node doesn't do projections.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ 	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+ 
+ 	/* make standalone slot to store previous tuple from outer node */
+ 	incrsortstate->prevSlot = MakeSingleTupleTableSlot(
+ 							ExecGetResultType(outerPlanState(incrsortstate)));
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "sort node initialized");
+ 
+ 	return incrsortstate;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecEndIncrementalSort(node)
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "shutting down sort node");
+ 
+ 	/*
+ 	 * clean out the tuple table
+ 	 */
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	/* must drop stanalone tuple slot from outer node */
+ 	ExecDropSingleTupleTableSlot(node->prevSlot);
+ 
+ 	/*
+ 	 * Release tuplesort resources
+ 	 */
+ 	if (node->tuplesortstate != NULL)
+ 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 
+ 	/*
+ 	 * shut down the subplan
+ 	 */
+ 	ExecEndNode(outerPlanState(node));
+ 
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "sort node shutdown");
+ }
+ 
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ 	PlanState  *outerPlan = outerPlanState(node);
+ 
+ 	/*
+ 	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ 	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ 	 * re-scan it at all.
+ 	 */
+ 	if (!node->sort_Done)
+ 		return;
+ 
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 
+ 	/*
+ 	 * If subnode is to be rescanned then we forget previous sort results; we
+ 	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+ 	 * bounded-sort parameters changed or we didn't select randomAccess.
+ 	 *
+ 	 * Otherwise we can just rewind and rescan the sorted output.
+ 	 */
+ 	node->sort_Done = false;
+ 	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 	node->bound_Done = 0;
+ 
+ 	/*
+ 	 * if chgParam of subnode is not null then plan will be re-scanned by
+ 	 * first ExecProcNode.
+ 	 */
+ 	if (outerPlan->chgParam == NULL)
+ 		ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 591a31a..cf228d6
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 05d8538..1288789
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 837,842 ****
--- 837,860 ----
  
  
  /*
+  * CopySortFields
+  *
+  *		This function copies the fields of the Sort node.  It is used by
+  *		all the copy functions for classes which inherit from Sort.
+  */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ 
+ 	COPY_SCALAR_FIELD(numCols);
+ 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+ 
+ /*
   * _copySort
   */
  static Sort *
*************** _copySort(const Sort *from)
*** 847,859 ****
  	/*
  	 * copy node superclass fields
  	 */
! 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
! 	COPY_SCALAR_FIELD(numCols);
! 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
  
  	return newnode;
  }
--- 865,893 ----
  	/*
  	 * copy node superclass fields
  	 */
! 	CopySortFields(from, newnode);
  
! 	return newnode;
! }
! 
! 
! /*
!  * _copyIncrementalSort
!  */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! 	IncrementalSort	   *newnode = makeNode(IncrementalSort);
! 
! 	/*
! 	 * copy node superclass fields
! 	 */
! 	CopySortFields((const Sort *) from, (Sort *) newnode);
! 
! 	/*
! 	 * copy remainder of node
! 	 */
! 	COPY_SCALAR_FIELD(skipCols);
  
  	return newnode;
  }
*************** copyObject(const void *from)
*** 4583,4588 ****
--- 4617,4625 ----
  		case T_Sort:
  			retval = _copySort(from);
  			break;
+ 		case T_IncrementalSort:
+ 			retval = _copyIncrementalSort(from);
+ 			break;
  		case T_Group:
  			retval = _copyGroup(from);
  			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index b3802b4..10cec96
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 781,792 ****
  }
  
  static void
! _outSort(StringInfo str, const Sort *node)
  {
  	int			i;
  
- 	WRITE_NODE_TYPE("SORT");
- 
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
--- 781,790 ----
  }
  
  static void
! _outSortInfo(StringInfo str, const Sort *node)
  {
  	int			i;
  
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 809,814 ****
--- 807,830 ----
  }
  
  static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ 	WRITE_NODE_TYPE("SORT");
+ 
+ 	_outSortInfo(str, node);
+ }
+ 
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ 	WRITE_NODE_TYPE("INCREMENTALSORT");
+ 
+ 	_outSortInfo(str, (const Sort *) node);
+ 
+ 	WRITE_INT_FIELD(skipCols);
+ }
+ 
+ static void
  _outUnique(StringInfo str, const Unique *node)
  {
  	int			i;
*************** outNode(StringInfo str, const void *obj)
*** 3482,3487 ****
--- 3498,3506 ----
  			case T_Sort:
  				_outSort(str, obj);
  				break;
+ 			case T_IncrementalSort:
+ 				_outIncrementalSort(str, obj);
+ 				break;
  			case T_Unique:
  				_outUnique(str, obj);
  				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index d2f69fe..c1b084e
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 1978,1989 ****
  }
  
  /*
!  * _readSort
   */
! static Sort *
! _readSort(void)
  {
! 	READ_LOCALS(Sort);
  
  	ReadCommonPlan(&local_node->plan);
  
--- 1978,1990 ----
  }
  
  /*
!  * ReadCommonSort
!  *	Assign the basic stuff of all nodes that inherit from Sort
   */
! static void
! ReadCommonSort(Sort *local_node)
  {
! 	READ_TEMP_LOCALS();
  
  	ReadCommonPlan(&local_node->plan);
  
*************** _readSort(void)
*** 1992,1997 ****
--- 1993,2024 ----
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
  	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+ 
+ /*
+  * _readSort
+  */
+ static Sort *
+ _readSort(void)
+ {
+ 	READ_LOCALS_NO_FIELDS(Sort);
+ 
+ 	ReadCommonSort(local_node);
+ 
+ 	READ_DONE();
+ }
+ 
+ /*
+  * _readIncrementalSort
+  */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ 	READ_LOCALS(IncrementalSort);
+ 
+ 	ReadCommonSort(&local_node->sort);
+ 
+ 	READ_INT_FIELD(skipCols);
  
  	READ_DONE();
  }
*************** parseNodeString(void)
*** 2520,2525 ****
--- 2547,2554 ----
  		return_value = _readMaterial();
  	else if (MATCH("SORT", 4))
  		return_value = _readSort();
+ 	else if (MATCH("INCREMENTALSORT", 7))
+ 		return_value = _readIncrementalSort();
  	else if (MATCH("GROUP", 5))
  		return_value = _readGroup();
  	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 633b5c1..bdbd8bf
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3097,3102 ****
--- 3097,3106 ----
  			ptype = "Sort";
  			subpath = ((SortPath *) path)->subpath;
  			break;
+ 		case T_IncrementalSortPath:
+ 			ptype = "IncrementalSort";
+ 			subpath = ((SortPath *) path)->subpath;
+ 			break;
  		case T_GroupPath:
  			ptype = "Group";
  			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index c138f57..a131c10
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
  bool		enable_bitmapscan = true;
  bool		enable_tidscan = true;
  bool		enable_sort = true;
+ bool		enable_incrementalsort = true;
  bool		enable_hashagg = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
*************** cost_recursive_union(Path *runion, Path 
*** 1418,1423 ****
--- 1419,1431 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or incremental sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1444,1450 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1452,1459 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1460,1478 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
  
  	path->rows = tuples;
  
--- 1469,1496 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
+ 	if (!enable_incrementalsort)
+ 		presorted_keys = false;
  
  	path->rows = tuples;
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1498,1510 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1516,1565 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1514,1520 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1569,1575 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1525,1534 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1580,1589 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1536,1549 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1591,1616 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2297,2302 ****
--- 2364,2371 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2323,2328 ****
--- 2392,2399 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 1065b31..9b06c6a
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
  	return PATHKEYS_EQUAL;
  }
  
+ 
+ /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
  /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 368,375 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 397,408 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied only partially then we would have to do
!  *	  incremental sort in order to satisfy pathkeys completely.  Since
!  *	  incremental sort consumes data by presorted groups, we would have to
!  *	  consume more data than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1461,1486 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
! 	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
- 
- 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1494,1535 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by incremental sort.
   */
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
  {
! 	int	n_common_pathkeys;
! 
! 	if (query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
! 
! 	if (enable_incrementalsort)
  	{
! 		/*
! 		 * Return the number of path keys in common, or 0 if there are none. Any
! 		 * first common pathkeys could be useful for ordering because we can use
! 		 * incremental sort.
! 		 */
! 		return n_common_pathkeys;
! 	}
! 	else
! 	{
! 		/* 
! 		 * When incremental sort is disabled, pathkeys are useful only when they
! 		 * do contain all the query pathkeys.
! 		 */
! 		if (n_common_pathkeys == list_length(query_pathkeys))
! 			return n_common_pathkeys;
! 		else
! 			return 0;
  	}
  }
  
  /*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1496,1502 ****
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
--- 1545,1551 ----
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 1e953b4..5625f2a
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 227,233 ****
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 227,233 ----
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 242,251 ****
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 242,253 ----
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 423,428 ****
--- 425,431 ----
  											   (GatherPath *) best_path);
  			break;
  		case T_Sort:
+ 		case T_IncrementalSort:
  			plan = (Plan *) create_sort_plan(root,
  											 (SortPath *) best_path,
  											 flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1067,1072 ****
--- 1070,1076 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1101,1109 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1105,1115 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1534,1539 ****
--- 1540,1546 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1543,1549 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1550,1560 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1789,1795 ****
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
--- 1800,1807 ----
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan,
! 										 0);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3621,3628 ****
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3633,3646 ----
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3633,3640 ****
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3651,3664 ----
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4686,4692 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4710,4717 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5208,5220 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node = makeNode(Sort);
! 	Plan	   *plan = &node->plan;
  
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
--- 5233,5263 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node;
! 	Plan	   *plan;
  
+ 	/* Always use regular sort node when enable_incrementalsort = false */
+ 	if (!enable_incrementalsort)
+ 		skipCols = 0;
+ 
+ 	if (skipCols == 0)
+ 	{
+ 		node = makeNode(Sort);
+ 	}
+ 	else
+ 	{
+ 		IncrementalSort    *incrementalSort;
+ 
+ 		incrementalSort = makeNode(IncrementalSort);
+ 		node = &incrementalSort->sort;
+ 		incrementalSort->skipCols = skipCols;
+ 	}
+ 
+ 	plan = &node->plan;
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5546,5552 ****
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5589,5595 ----
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5566,5572 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5609,5615 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5609,5615 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5652,5658 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5630,5636 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5673,5680 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5663,5669 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5707,5713 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** is_projection_capable_plan(Plan *plan)
*** 6311,6316 ****
--- 6355,6361 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index c3fbf3c..5fe1235
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index ca0ae78..6e4f223
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3497,3510 ****
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				bool		is_sorted;
  
! 				is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 												  path->pathkeys);
! 				if (path == cheapest_partial_path || is_sorted)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (!is_sorted)
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
--- 3497,3510 ----
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				int			n_useful_pathkeys;
  
! 				n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (n_useful_pathkeys < list_length(root->group_pathkeys))
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3577,3590 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3577,3590 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_useful_pathkeys;
  
! 			n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 			if (path == cheapest_path || n_useful_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4239,4251 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 4239,4251 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_useful_pathkeys;
  
! 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! 														 path->pathkeys);
! 		if (path == cheapest_input_path || n_useful_pathkeys > 0)
  		{
! 			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 5324,5331 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 5324,5332 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index 07ddbcf..0534ac8
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 608,613 ****
--- 608,614 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 3eb2bb7..69ad4d3
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2692,2697 ****
--- 2692,2698 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_Gather:
  		case T_SetOp:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index 1389db1..0972d4b
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 963,969 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 963,970 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 3248296..1faf100
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1293,1304 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1293,1305 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1312,1317 ****
--- 1313,1320 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1548,1554 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1551,1558 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2399,2407 ****
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode = makeNode(SortPath);
  
- 	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
--- 2403,2433 ----
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode;
! 	int			n_common_pathkeys;
! 
! 	if (enable_incrementalsort)
! 		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! 	else
! 		n_common_pathkeys = 0;
! 
! 	if (n_common_pathkeys == 0)
! 	{
! 		pathnode = makeNode(SortPath);
! 		pathnode->path.pathtype = T_Sort;
! 	}
! 	else
! 	{
! 		IncrementalSortPath   *incpathnode;
! 
! 		incpathnode = makeNode(IncrementalSortPath);
! 		pathnode = &incpathnode->spath;
! 		pathnode->path.pathtype = T_IncrementalSort;
! 		incpathnode->skipCols = n_common_pathkeys;
! 	}
! 
! 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2415,2421 ****
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2441,2449 ----
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2687,2693 ****
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
--- 2715,2722 ----
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL, 0,
! 					  0.0,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index f9f18f2..9607889
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 276,282 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 276,282 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 8b05e8f..ab66784
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3521,3526 ****
--- 3521,3562 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucketsize fraction (ie, number of entries in a bucket
   * divided by total tuples in relation) if the specified expression is used
   * as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 0707f66..9e00658
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 857,862 ****
--- 857,871 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of incremental sort steps."),
+ 			NULL
+ 		},
+ 		&enable_incrementalsort,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hashed aggregation plans."),
  			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index e1e692d..af93ae4
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 281,286 ****
--- 281,291 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	TupSortStatus maxStatus;	/* maximum status reached between sort groups */
+ 	int64		maxMem;			/* maximum amount of memory used between
+ 								   sort groups */
+ 	bool		maxMemOnDisk;	/* is maxMem value for on-disk memory */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 633,638 ****
--- 638,646 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 667,685 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 675,704 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 696,702 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 715,721 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 714,719 ****
--- 733,739 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 754,766 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 774,787 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 802,808 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 823,829 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 833,839 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 854,860 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 924,930 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 945,951 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 997,1003 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1018,1024 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1034,1040 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1055,1061 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1145,1160 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1166,1177 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1213,1219 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1230,1327 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	memUsed;
! 	bool	memUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		memUsedOnDisk = true;
! 		memUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		memUsedOnDisk = false;
! 		memUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	state->maxStatus = Max(state->maxStatus, state->status);
! 	if (memUsed > state->maxMem)
! 	{
! 		state->maxMem = memUsed;
! 		state->maxMemOnDisk = memUsedOnDisk;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->availMem = state->allowedMem;
! 	state->lastReturnedTuple = NULL;
! 	state->slabAllocatorUsed = false;
! 	state->slabMemoryBegin = NULL;
! 	state->slabMemoryEnd = NULL;
! 	state->slabFreeHead = NULL;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3219,3245 ****
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
  		*spaceType = "Disk";
- 		*spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		*spaceType = "Memory";
! 		*spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3327,3341 ----
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxMemOnDisk)
  		*spaceType = "Disk";
  	else
  		*spaceType = "Memory";
! 	*spaceUsed = (state->maxMem + 1023) / 1024;
  
! 	switch (state->maxStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncrementalSort.h
+  *
+  *
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/executor/nodeIncrementalSort.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+ 
+ #include "nodes/execnodes.h"
+ 
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ 													EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+ 
+ #endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 6332ea0..0d63c65
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1817,1822 ****
--- 1817,1836 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1833,1838 ****
--- 1847,1872 ----
  	void	   *tuplesortstate; /* private state of tuplesort.c */
  } SortState;
  
+ /* ----------------
+  *	 IncrementalSortState information
+  * ----------------
+  */
+ typedef struct IncrementalSortState
+ {
+ 	ScanState	ss;				/* its first field is NodeTag */
+ 	bool		bounded;		/* is the result set bounded? */
+ 	int64		bound;			/* if bounded, how many tuples are needed */
+ 	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
+ 	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	int64		bound_Done;		/* value of bound we did the sort with */
+ 	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	int64		groupsCount;	/* number of groups with equal skip keys */
+ 	TupleTableSlot *prevSlot;	/* slot for previous tuple from outer node */
+ } IncrementalSortState;
+ 
  /* ---------------------
   *	GroupState information
   * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index 95dd8ba..24b49a7
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 71,76 ****
--- 71,77 ----
  	T_HashJoin,
  	T_Material,
  	T_Sort,
+ 	T_IncrementalSort,
  	T_Group,
  	T_Agg,
  	T_WindowAgg,
*************** typedef enum NodeTag
*** 120,125 ****
--- 121,127 ----
  	T_HashJoinState,
  	T_MaterialState,
  	T_SortState,
+ 	T_IncrementalSortState,
  	T_GroupState,
  	T_AggState,
  	T_WindowAggState,
*************** typedef enum NodeTag
*** 249,254 ****
--- 251,257 ----
  	T_ProjectionPath,
  	T_ProjectSetPath,
  	T_SortPath,
+ 	T_IncrementalSortPath,
  	T_GroupPath,
  	T_UpperUniquePath,
  	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index f72f7a8..2a776ee
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 699,704 ****
--- 699,715 ----
  	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
  } Sort;
  
+ 
+ /* ----------------
+  *		incremental sort node
+  * ----------------
+  */
+ typedef struct IncrementalSort
+ {
+ 	Sort		sort;
+ 	int			skipCols;		/* number of presorted columns */
+ } IncrementalSort;
+ 
  /* ---------------
   *	 group node -
   *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index f7ac6f6..b0ab815
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1331,1336 ****
--- 1331,1346 ----
  } SortPath;
  
  /*
+  * IncrementalSortPath
+  */
+ typedef struct IncrementalSortPath
+ {
+ 	SortPath	spath;
+ 	int			skipCols;
+ } IncrementalSortPath;
+ 
+ 
+ /*
   * GroupPath represents grouping (of presorted input)
   *
   * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 72200fa..09067f4
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_bitmapscan;
  extern bool enable_tidscan;
  extern bool enable_sort;
+ extern bool enable_incrementalsort;
  extern bool enable_hashagg;
  extern bool enable_nestloop;
  extern bool enable_material;
*************** extern void cost_ctescan(Path *path, Pla
*** 95,102 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 96,104 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index ebda308..3271203
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 180,185 ****
--- 180,186 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
*************** extern List *select_outer_pathkeys_for_m
*** 216,221 ****
--- 217,223 ----
  extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
  							  List *mergeclauses,
  							  List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
  extern List *truncate_useless_pathkeys(PlannerInfo *root,
  						  RelOptInfo *rel,
  						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
  						 double nbuckets);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5b3f475..616f9f5
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 62,69 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 104,109 ****
--- 105,112 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort           
*** 19,27 ****
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Sort           
    Sort Key: id, data
!   ->  Seq Scan on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
--- 19,28 ----
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Incremental Sort
    Sort Key: id, data
!   Presorted Key: id
!   ->  Index Scan using test_dc_pkey on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index 0ff8062..3ad5eb3
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t
*** 996,1010 ****
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                       QUERY PLAN                       
! -------------------------------------------------------
!  HashAggregate
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Merge Join
!          Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!          ->  Index Scan using t1_pkey on t1
!          ->  Index Scan using t2_pkey on t2
! (6 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
--- 996,1013 ----
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                          QUERY PLAN                          
! -------------------------------------------------------------
!  Group
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Incremental Sort
!          Sort Key: t1.a, t1.b, t2.z
!          Presorted Key: t1.a, t1.b
!          ->  Merge Join
!                Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!                ->  Index Scan using t1_pkey on t1
!                ->  Index Scan using t2_pkey on t2
! (9 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index a8c8b28..2925e55
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE:  drop cascades to table matest1
*** 1448,1453 ****
--- 1448,1454 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
  SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1588,1596 ****
--- 1589,1633 ----
   {3,7,8,10,13,13,16,18,19,22}
  (3 rows)
  
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+                                QUERY PLAN                                
+ -------------------------------------------------------------------------
+  Merge Append
+    Sort Key: tenk1.thousand, tenk1.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+    ->  Incremental Sort
+          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
+          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Merge Append
+    Sort Key: a.thousand, a.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+    ->  Incremental Sort
+          Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
+          ->  Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  --
  -- Check that constraint exclusion works correctly with partitions using
  -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index d48abd7..119f7d5
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,89 ****
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!          name         | setting 
! ----------------------+---------
!  enable_bitmapscan    | on
!  enable_hashagg       | on
!  enable_hashjoin      | on
!  enable_indexonlyscan | on
!  enable_indexscan     | on
!  enable_material      | on
!  enable_mergejoin     | on
!  enable_nestloop      | on
!  enable_seqscan       | on
!  enable_sort          | on
!  enable_tidscan       | on
! (11 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
--- 70,90 ----
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!           name          | setting 
! ------------------------+---------
!  enable_bitmapscan      | on
!  enable_hashagg         | on
!  enable_hashjoin        | on
!  enable_incrementalsort | on
!  enable_indexonlyscan   | on
!  enable_indexscan       | on
!  enable_material        | on
!  enable_mergejoin       | on
!  enable_nestloop        | on
!  enable_seqscan         | on
!  enable_sort            | on
!  enable_tidscan         | on
! (12 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index a8b7eb1..39dd786
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 498,503 ****
--- 498,504 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
*************** SELECT
*** 559,567 ****
--- 560,585 ----
      ORDER BY f.i LIMIT 10)
  FROM generate_series(1, 3) g(i);
  
+ set enable_incrementalsort = on;
+ 
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  
  --
  -- Check that constraint exclusion works correctly with partitions using

Mithun Cy

mithun.cy@enterprisedb.com

almost 9 years ago

In reply to: Alexander Korotkov (#3)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Feb 27, 2017 at 8:29 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

This patch needs to be rebased.

1. It fails while applying as below

patching file src/test/regress/expected/sysviews.out
Hunk #1 FAILED at 70.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/expected/sysviews.out.rej
patching file src/test/regress/sql/inherit.sql

2. Also, there are compilation errors due to new commits.

-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE -c -o createplan.o createplan.c
createplan.c: In function ‘create_gather_merge_plan’:
createplan.c:1510:11: warning: passing argument 3 of ‘make_sort’ makes
integer from pointer without a cast [enabled by default]
gm_plan->nullsFirst);
^
createplan.c:235:14: note: expected ‘int’ but argument is of type ‘AttrNumber *’
static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
^
createplan.c:1510:11: warning: passing argument 4 of ‘make_sort’ from
incompatible pointer type [enabled by default]
gm_plan->nullsFirst);

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 9 years ago

In reply to: Mithun Cy (#4)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Dear Mithun,

On Mon, Mar 20, 2017 at 10:01 AM, Mithun Cy <mithun.cy@enterprisedb.com>
wrote:

On Mon, Feb 27, 2017 at 8:29 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

This patch needs to be rebased.

1. It fails while applying as below

patching file src/test/regress/expected/sysviews.out
Hunk #1 FAILED at 70.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/expected/sysviews.out.rej
patching file src/test/regress/sql/inherit.sql

2. Also, there are compilation errors due to new commits.

-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE -c -o createplan.o createplan.c
createplan.c: In function ‘create_gather_merge_plan’:
createplan.c:1510:11: warning: passing argument 3 of ‘make_sort’ makes
integer from pointer without a cast [enabled by default]
gm_plan->nullsFirst);
^
createplan.c:235:14: note: expected ‘int’ but argument is of type
‘AttrNumber *’
static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
^
createplan.c:1510:11: warning: passing argument 4 of ‘make_sort’ from
incompatible pointer type [enabled by default]
gm_plan->nullsFirst);

Thank you for the report.
Please, find rebased patch in the attachment.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-3.patchapplication/octet-stream; name=incremental-sort-3.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 059c5c3..185a0da
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1913,1951 ****
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
   Limit
!    Output: t1.c1, t2.c1
     ->  Sort
!          Output: t1.c1, t2.c1
!          Sort Key: t1.c1, t2.c1
           ->  Nested Loop
!                Output: t1.c1, t2.c1
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c1
!                      Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c1
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c1
!                            Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!  c1 | c1  
! ----+-----
!   1 | 101
!   1 | 102
!   1 | 103
!   1 | 104
!   1 | 105
!   1 | 106
!   1 | 107
!   1 | 108
!   1 | 109
!   1 | 110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
--- 1913,1951 ----
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!                             QUERY PLAN                            
! ------------------------------------------------------------------
   Limit
!    Output: t1.c3, t2.c3
     ->  Sort
!          Output: t1.c3, t2.c3
!          Sort Key: t1.c3, t2.c3
           ->  Nested Loop
!                Output: t1.c3, t2.c3
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c3
!                      Remote SQL: SELECT c3 FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c3
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c3
!                            Remote SQL: SELECT c3 FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!   c3   |  c3   
! -------+-------
!  00001 | 00101
!  00001 | 00102
!  00001 | 00103
!  00001 | 00104
!  00001 | 00105
!  00001 | 00106
!  00001 | 00107
!  00001 | 00108
!  00001 | 00109
!  00001 | 00110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2487,2504 ****
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                 QUERY PLAN                                                
! ----------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          ->  Foreign Scan
!                Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
!                Relations: Aggregate on (public.ft1)
!                Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
--- 2487,2507 ----
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                         QUERY PLAN                                                        
! --------------------------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Incremental Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          Presorted Key: ft1.c2
!          ->  GroupAggregate
!                Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
!                Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
!                ->  Foreign Scan on public.ft1
!                      Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
!                      Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 8f3edc1..a13d556
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 479,486 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 479,486 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index b379b67..3dfe6a5
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3538,3543 ****
--- 3538,3557 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+       <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of incremental sort
+         steps. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
        <term><varname>enable_indexscan</varname> (<type>boolean</type>)
        <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index c9b55ea..036a410
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual, 
*** 79,84 ****
--- 79,86 ----
  				ExplainState *es);
  static void show_sort_keys(SortState *sortstate, List *ancestors,
  			   ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 					   List *ancestors, ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 92,98 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 94,100 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 100,105 ****
--- 102,109 ----
  static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
  				 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 									   ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  					ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 993,998 ****
--- 997,1005 ----
  		case T_Sort:
  			pname = sname = "Sort";
  			break;
+ 		case T_IncrementalSort:
+ 			pname = sname = "Incremental Sort";
+ 			break;
  		case T_Group:
  			pname = sname = "Group";
  			break;
*************** ExplainNode(PlanState *planstate, List *
*** 1561,1566 ****
--- 1568,1579 ----
  			show_sort_keys(castNode(SortState, planstate), ancestors, es);
  			show_sort_info(castNode(SortState, planstate), es);
  			break;
+ 		case T_IncrementalSort:
+ 			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ 									   ancestors, es);
+ 			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ 									   es);
+ 			break;
  		case T_MergeAppend:
  			show_merge_append_keys(castNode(MergeAppendState, planstate),
  								   ancestors, es);
*************** static void
*** 1886,1900 ****
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
   * Likewise, for a MergeAppend node.
   */
  static void
--- 1899,1936 ----
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+ 	int			skipCols;
+ 
+ 	if (IsA(plan, IncrementalSort))
+ 		skipCols = ((IncrementalSort *) plan)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
+  * Show the sort keys for a IncrementalSort node.
+  */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 						   List *ancestors, ExplainState *es)
+ {
+ 	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+ 
+ 	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ 						 plan->sort.numCols, plan->skipCols,
+ 						 plan->sort.sortColIdx,
+ 						 plan->sort.sortOperators, plan->sort.collations,
+ 						 plan->sort.nullsFirst,
+ 						 ancestors, es);
+ }
+ 
+ /*
   * Likewise, for a MergeAppend node.
   */
  static void
*************** show_merge_append_keys(MergeAppendState 
*** 1904,1910 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1940,1946 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1928,1934 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1964,1970 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 1984,1990 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 2020,2026 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2041,2047 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 2077,2083 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2054,2066 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 2090,2103 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2100,2108 ****
--- 2137,2149 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2260,2265 ****
--- 2301,2343 ----
  }
  
  /*
+  * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+  */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 						   ExplainState *es)
+ {
+ 	if (es->analyze && incrsortstate->sort_Done &&
+ 		incrsortstate->tuplesortstate != NULL)
+ 	{
+ 		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ 		const char *sortMethod;
+ 		const char *spaceType;
+ 		long		spaceUsed;
+ 
+ 		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+ 							 sortMethod, spaceType, spaceUsed);
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort groups: %ld\n",
+ 							 incrsortstate->groupsCount);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyText("Sort Method", sortMethod, es);
+ 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			ExplainPropertyLong("Sort groups: %ld",
+ 								incrsortstate->groupsCount, es);
+ 		}
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index d281906..1b97d1c
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execGroup
*** 23,30 ****
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
!        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
         nodeTableFuncscan.o
--- 23,31 ----
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
!        nodeSort.o nodeIncrementalSort.o \
!        nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
         nodeTableFuncscan.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 5d59f95..e04175a
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 243,248 ****
--- 244,253 ----
  			ExecReScanSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecReScanIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecReScanGroup((GroupState *) node);
  			break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 514,521 ****
--- 519,530 ----
  		case T_CteScan:
  		case T_Material:
  		case T_Sort:
+ 			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_IncrementalSort:
+ 			return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 80c77ad..1fa1de4
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 93,98 ****
--- 93,99 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 302,307 ****
--- 303,313 ----
  												estate, eflags);
  			break;
  
+ 		case T_IncrementalSort:
+ 			result = (PlanState *) ExecInitIncrementalSort(
+ 									(IncrementalSort *) node, estate, eflags);
+ 			break;
+ 
  		case T_Group:
  			result = (PlanState *) ExecInitGroup((Group *) node,
  												 estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 521,526 ****
--- 527,536 ----
  			result = ExecSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			result = ExecIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			result = ExecGroup((GroupState *) node);
  			break;
*************** ExecEndNode(PlanState *node)
*** 789,794 ****
--- 799,808 ----
  			ExecEndSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecEndIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecEndGroup((GroupState *) node);
  			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index 3207ee4..aa9dfcc
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 559,564 ****
--- 559,565 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 637,643 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 638,644 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...04576c6
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,485 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncremenalSort.c
+  *	  Routines to handle incremental sorting of relations.
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/executor/nodeIncremenalSort.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+ 
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ 															TupleTableSlot *b)
+ {
+ 	int n, i;
+ 
+ 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+ 
+ 	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = slot_getattr(a, attno, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	int					skipCols,
+ 						i;
+ 
+ 	Assert(IsA(plannode, IncrementalSort));
+ 	skipCols = plannode->skipCols;
+ 
+ 	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sort.sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 										plannode->sort.sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sort.sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->sort.collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
+ 
+ /* ----------------------------------------------------------------
+  *		ExecIncrementalSort
+  *
+  *		Assuming that outer subtree returns tuple presorted by some prefix
+  *		of target sort columns, performs incremental sort.  It fetches
+  *		groups of tuples where prefix sort columns are equal and sorts them
+  *		using tuplesort.  This approach allows to evade sorting of whole
+  *		dataset.  Besides taking less memory and being faster, it allows to
+  *		start returning tuples before fetching full dataset from outer
+  *		subtree.
+  *
+  *		Conditions:
+  *		  -- none.
+  *
+  *		Initial States:
+  *		  -- the outer child is prepared to return the first tuple.
+  * ----------------------------------------------------------------
+  */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ 	EState			   *estate;
+ 	ScanDirection		dir;
+ 	Tuplesortstate	   *tuplesortstate;
+ 	TupleTableSlot	   *slot;
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	PlanState		   *outerNode;
+ 	int					skipCols;
+ 	TupleDesc			tupDesc;
+ 	int64				nTuples = 0;
+ 
+ 	skipCols = plannode->skipCols;
+ 
+ 	/*
+ 	 * get state info from node
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "entering routine");
+ 
+ 	estate = node->ss.ps.state;
+ 	dir = estate->es_direction;
+ 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+ 
+ 	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
+ 	 * If first time through, read all tuples from outer plan and pass them to
+ 	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ 	 */
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "sorting subplan");
+ 
+ 	/*
+ 	 * Want to scan subplan in the forward direction while creating the
+ 	 * sorted data.
+ 	 */
+ 	estate->es_direction = ForwardScanDirection;
+ 
+ 	/*
+ 	 * Initialize tuplesort module.
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "calling tuplesort_begin");
+ 
+ 	outerNode = outerPlanState(node);
+ 	tupDesc = ExecGetResultType(outerNode);
+ 
+ 	if (node->tuplesortstate == NULL)
+ 	{
+ 		/*
+ 		 * We are going to process the first group of presorted data.
+ 		 * Initialize support structures for cmpSortSkipCols - already
+ 		 * sorted columns.
+ 		 */
+ 		prepareSkipCols(node);
+ 
+ 		/*
+ 		 * Only pass on remaining columns that are unsorted.  Skip
+ 		 * abbreviated keys usage for incremental sort.  We unlikely will
+ 		 * have huge groups with incremental sort.  Therefore usage of
+ 		 * abbreviated keys would be likely a waste of time.
+ 		 */
+ 		tuplesortstate = tuplesort_begin_heap(
+ 									tupDesc,
+ 									plannode->sort.numCols - skipCols,
+ 									&(plannode->sort.sortColIdx[skipCols]),
+ 									&(plannode->sort.sortOperators[skipCols]),
+ 									&(plannode->sort.collations[skipCols]),
+ 									&(plannode->sort.nullsFirst[skipCols]),
+ 									work_mem,
+ 									false,
+ 									true);
+ 		node->tuplesortstate = (void *) tuplesortstate;
+ 		node->groupsCount++;
+ 	}
+ 	else
+ 	{
+ 		/* Next group of presorted data */
+ 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ 		node->groupsCount++;
+ 	}
+ 
+ 	/* Calculate remaining bound for bounded sort */
+ 	if (node->bounded)
+ 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ 
+ 	/*
+ 	 * Put next group of tuples where skipCols sort values are equal to
+ 	 * tuplesort.
+ 	 */
+ 	for (;;)
+ 	{
+ 		slot = ExecProcNode(outerNode);
+ 
+ 		/* Put next group of presorted data to the tuplesort */
+ 		if (node->prevSlot->tts_isempty)
+ 		{
+ 			/* First tuple */
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			else
+ 			{
+ 				ExecCopySlot(node->prevSlot, slot);
+ 			}
+ 		}
+ 		else
+ 		{
+ 			/* Put previous tuple into tuplesort */
+ 			tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ 			nTuples++;
+ 
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			else
+ 			{
+ 				bool	cmp;
+ 				cmp = cmpSortSkipCols(node, node->prevSlot, slot);
+ 
+ 				/* Replace previous tuple with current one */
+ 				ExecCopySlot(node->prevSlot, slot);
+ 
+ 				/*
+ 				 * When skipCols are not equal then group of presorted data
+ 				 * is finished
+ 				 */
+ 				if (!cmp)
+ 					break;
+ 			}
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Complete the sort.
+ 	 */
+ 	tuplesort_performsort(tuplesortstate);
+ 
+ 	/*
+ 	 * restore to user specified direction
+ 	 */
+ 	estate->es_direction = dir;
+ 
+ 	/*
+ 	 * finally set the sorted flag to true
+ 	 */
+ 	node->sort_Done = true;
+ 	node->bounded_Done = node->bounded;
+ 
+ 	/*
+ 	 * Adjust bound_Done with number of tuples we've actually sorted.
+ 	 */
+ 	if (node->bounded)
+ 	{
+ 		if (node->finished)
+ 			node->bound_Done = node->bound;
+ 		else
+ 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ 	}
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "retrieving tuple from tuplesort");
+ 
+ 	/*
+ 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+ 	 * tuples.
+ 	 */
+ 	slot = node->ss.ps.ps_ResultTupleSlot;
+ 	(void) tuplesort_gettupleslot(tuplesortstate,
+ 								  ScanDirectionIsForward(dir),
+ 								  slot, NULL);
+ 	return slot;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitIncrementalSort
+  *
+  *		Creates the run-time state information for the sort node
+  *		produced by the planner and initializes its outer subtree.
+  * ----------------------------------------------------------------
+  */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ 	IncrementalSortState   *incrsortstate;
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "initializing sort node");
+ 
+ 	/*
+ 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ 	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ 	 * bucket in tuplesortstate.
+ 	 */
+ 	Assert((eflags & (EXEC_FLAG_REWIND |
+ 					  EXEC_FLAG_BACKWARD |
+ 					  EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
+ 	 * create state structure
+ 	 */
+ 	incrsortstate = makeNode(IncrementalSortState);
+ 	incrsortstate->ss.ps.plan = (Plan *) node;
+ 	incrsortstate->ss.ps.state = estate;
+ 
+ 	incrsortstate->bounded = false;
+ 	incrsortstate->sort_Done = false;
+ 	incrsortstate->finished = false;
+ 	incrsortstate->tuplesortstate = NULL;
+ 	incrsortstate->prevSlot = NULL;
+ 	incrsortstate->bound_Done = 0;
+ 	incrsortstate->groupsCount = 0;
+ 	incrsortstate->skipKeys = NULL;
+ 
+ 	/*
+ 	 * Miscellaneous initialization
+ 	 *
+ 	 * Sort nodes don't initialize their ExprContexts because they never call
+ 	 * ExecQual or ExecProject.
+ 	 */
+ 
+ 	/*
+ 	 * tuple table initialization
+ 	 *
+ 	 * sort nodes only return scan tuples from their sorted relation.
+ 	 */
+ 	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ 	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+ 
+ 	/*
+ 	 * initialize child nodes
+ 	 *
+ 	 * We shield the child node from the need to support REWIND, BACKWARD, or
+ 	 * MARK/RESTORE.
+ 	 */
+ 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+ 
+ 	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+ 
+ 	/*
+ 	 * initialize tuple type.  no need to initialize projection info because
+ 	 * this node doesn't do projections.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ 	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+ 
+ 	/* make standalone slot to store previous tuple from outer node */
+ 	incrsortstate->prevSlot = MakeSingleTupleTableSlot(
+ 							ExecGetResultType(outerPlanState(incrsortstate)));
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "sort node initialized");
+ 
+ 	return incrsortstate;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecEndIncrementalSort(node)
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "shutting down sort node");
+ 
+ 	/*
+ 	 * clean out the tuple table
+ 	 */
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	/* must drop stanalone tuple slot from outer node */
+ 	ExecDropSingleTupleTableSlot(node->prevSlot);
+ 
+ 	/*
+ 	 * Release tuplesort resources
+ 	 */
+ 	if (node->tuplesortstate != NULL)
+ 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 
+ 	/*
+ 	 * shut down the subplan
+ 	 */
+ 	ExecEndNode(outerPlanState(node));
+ 
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "sort node shutdown");
+ }
+ 
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ 	PlanState  *outerPlan = outerPlanState(node);
+ 
+ 	/*
+ 	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ 	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ 	 * re-scan it at all.
+ 	 */
+ 	if (!node->sort_Done)
+ 		return;
+ 
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 
+ 	/*
+ 	 * If subnode is to be rescanned then we forget previous sort results; we
+ 	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+ 	 * bounded-sort parameters changed or we didn't select randomAccess.
+ 	 *
+ 	 * Otherwise we can just rewind and rescan the sorted output.
+ 	 */
+ 	node->sort_Done = false;
+ 	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 	node->bound_Done = 0;
+ 
+ 	/*
+ 	 * if chgParam of subnode is not null then plan will be re-scanned by
+ 	 * first ExecProcNode.
+ 	 */
+ 	if (outerPlan->chgParam == NULL)
+ 		ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 591a31a..cf228d6
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 25fd051..f82f620
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 885,890 ****
--- 885,908 ----
  
  
  /*
+  * CopySortFields
+  *
+  *		This function copies the fields of the Sort node.  It is used by
+  *		all the copy functions for classes which inherit from Sort.
+  */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ 
+ 	COPY_SCALAR_FIELD(numCols);
+ 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+ 
+ /*
   * _copySort
   */
  static Sort *
*************** _copySort(const Sort *from)
*** 895,907 ****
  	/*
  	 * copy node superclass fields
  	 */
! 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
! 	COPY_SCALAR_FIELD(numCols);
! 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
  
  	return newnode;
  }
--- 913,941 ----
  	/*
  	 * copy node superclass fields
  	 */
! 	CopySortFields(from, newnode);
  
! 	return newnode;
! }
! 
! 
! /*
!  * _copyIncrementalSort
!  */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! 	IncrementalSort	   *newnode = makeNode(IncrementalSort);
! 
! 	/*
! 	 * copy node superclass fields
! 	 */
! 	CopySortFields((const Sort *) from, (Sort *) newnode);
! 
! 	/*
! 	 * copy remainder of node
! 	 */
! 	COPY_SCALAR_FIELD(skipCols);
  
  	return newnode;
  }
*************** copyObject(const void *from)
*** 4686,4691 ****
--- 4720,4728 ----
  		case T_Sort:
  			retval = _copySort(from);
  			break;
+ 		case T_IncrementalSort:
+ 			retval = _copyIncrementalSort(from);
+ 			break;
  		case T_Group:
  			retval = _copyGroup(from);
  			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index 7418fbe..d78fd02
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 822,833 ****
  }
  
  static void
! _outSort(StringInfo str, const Sort *node)
  {
  	int			i;
  
- 	WRITE_NODE_TYPE("SORT");
- 
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
--- 822,831 ----
  }
  
  static void
! _outSortInfo(StringInfo str, const Sort *node)
  {
  	int			i;
  
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 850,855 ****
--- 848,871 ----
  }
  
  static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ 	WRITE_NODE_TYPE("SORT");
+ 
+ 	_outSortInfo(str, node);
+ }
+ 
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ 	WRITE_NODE_TYPE("INCREMENTALSORT");
+ 
+ 	_outSortInfo(str, (const Sort *) node);
+ 
+ 	WRITE_INT_FIELD(skipCols);
+ }
+ 
+ static void
  _outUnique(StringInfo str, const Unique *node)
  {
  	int			i;
*************** outNode(StringInfo str, const void *obj)
*** 3591,3596 ****
--- 3607,3615 ----
  			case T_Sort:
  				_outSort(str, obj);
  				break;
+ 			case T_IncrementalSort:
+ 				_outIncrementalSort(str, obj);
+ 				break;
  			case T_Unique:
  				_outUnique(str, obj);
  				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index d3bbc02..65f7ff0
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2021,2032 ****
  }
  
  /*
!  * _readSort
   */
! static Sort *
! _readSort(void)
  {
! 	READ_LOCALS(Sort);
  
  	ReadCommonPlan(&local_node->plan);
  
--- 2021,2033 ----
  }
  
  /*
!  * ReadCommonSort
!  *	Assign the basic stuff of all nodes that inherit from Sort
   */
! static void
! ReadCommonSort(Sort *local_node)
  {
! 	READ_TEMP_LOCALS();
  
  	ReadCommonPlan(&local_node->plan);
  
*************** _readSort(void)
*** 2035,2040 ****
--- 2036,2067 ----
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
  	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+ 
+ /*
+  * _readSort
+  */
+ static Sort *
+ _readSort(void)
+ {
+ 	READ_LOCALS_NO_FIELDS(Sort);
+ 
+ 	ReadCommonSort(local_node);
+ 
+ 	READ_DONE();
+ }
+ 
+ /*
+  * _readIncrementalSort
+  */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ 	READ_LOCALS(IncrementalSort);
+ 
+ 	ReadCommonSort(&local_node->sort);
+ 
+ 	READ_INT_FIELD(skipCols);
  
  	READ_DONE();
  }
*************** parseNodeString(void)
*** 2587,2592 ****
--- 2614,2621 ----
  		return_value = _readMaterial();
  	else if (MATCH("SORT", 4))
  		return_value = _readSort();
+ 	else if (MATCH("INCREMENTALSORT", 7))
+ 		return_value = _readIncrementalSort();
  	else if (MATCH("GROUP", 5))
  		return_value = _readGroup();
  	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 43bfd23..a9c9005
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3209,3214 ****
--- 3209,3218 ----
  			ptype = "Sort";
  			subpath = ((SortPath *) path)->subpath;
  			break;
+ 		case T_IncrementalSortPath:
+ 			ptype = "IncrementalSort";
+ 			subpath = ((SortPath *) path)->subpath;
+ 			break;
  		case T_GroupPath:
  			ptype = "Group";
  			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index a129d1e..5af59f1
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
  bool		enable_bitmapscan = true;
  bool		enable_tidscan = true;
  bool		enable_sort = true;
+ bool		enable_incrementalsort = true;
  bool		enable_hashagg = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
*************** cost_recursive_union(Path *runion, Path 
*** 1563,1568 ****
--- 1564,1576 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or incremental sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1589,1595 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1597,1604 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1605,1623 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
  
  	path->rows = tuples;
  
--- 1614,1641 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
+ 	if (!enable_incrementalsort)
+ 		presorted_keys = false;
  
  	path->rows = tuples;
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1643,1655 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1661,1710 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1659,1665 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1714,1720 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1670,1679 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1725,1734 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1681,1694 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1736,1761 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2447,2452 ****
--- 2514,2521 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2473,2478 ****
--- 2542,2549 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 2c26906..2da6f40
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
  	return PATHKEYS_EQUAL;
  }
  
+ 
+ /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
  /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 402,413 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied only partially then we would have to do
!  *	  incremental sort in order to satisfy pathkeys completely.  Since
!  *	  incremental sort consumes data by presorted groups, we would have to
!  *	  consume more data than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
! 	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
- 
- 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1521,1562 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by incremental sort.
   */
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
  {
! 	int	n_common_pathkeys;
! 
! 	if (query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
! 
! 	if (enable_incrementalsort)
  	{
! 		/*
! 		 * Return the number of path keys in common, or 0 if there are none. Any
! 		 * first common pathkeys could be useful for ordering because we can use
! 		 * incremental sort.
! 		 */
! 		return n_common_pathkeys;
! 	}
! 	else
! 	{
! 		/* 
! 		 * When incremental sort is disabled, pathkeys are useful only when they
! 		 * do contain all the query pathkeys.
! 		 */
! 		if (n_common_pathkeys == list_length(query_pathkeys))
! 			return n_common_pathkeys;
! 		else
! 			return 0;
  	}
  }
  
  /*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
--- 1572,1578 ----
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 89e1946..f80740e
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 232,238 ****
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 232,238 ----
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 247,256 ****
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 247,258 ----
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 431,436 ****
--- 433,439 ----
  											   (GatherPath *) best_path);
  			break;
  		case T_Sort:
+ 		case T_IncrementalSort:
  			plan = (Plan *) create_sort_plan(root,
  											 (SortPath *) best_path,
  											 flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1087,1092 ****
--- 1090,1096 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1121,1129 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1125,1135 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1470,1475 ****
--- 1476,1482 ----
  	Plan	   *subplan;
  	List	   *pathkeys = best_path->path.pathkeys;
  	List	   *tlist = build_path_tlist(root, &best_path->path);
+ 	int			n_common_pathkeys;
  
  	/* As with Gather, it's best to project away columns in the workers. */
  	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1496,1507 ****
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
--- 1503,1518 ----
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! 	if (n_common_pathkeys < list_length(pathkeys))
! 	{
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ 									 n_common_pathkeys,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
+ 	}
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1609,1614 ****
--- 1620,1626 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1618,1624 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1630,1640 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1864,1870 ****
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
--- 1880,1887 ----
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan,
! 										 0);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3742,3749 ****
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3759,3772 ----
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3754,3761 ****
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3777,3790 ----
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4807,4813 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4836,4843 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5366,5378 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node = makeNode(Sort);
! 	Plan	   *plan = &node->plan;
  
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
--- 5396,5426 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node;
! 	Plan	   *plan;
  
+ 	/* Always use regular sort node when enable_incrementalsort = false */
+ 	if (!enable_incrementalsort)
+ 		skipCols = 0;
+ 
+ 	if (skipCols == 0)
+ 	{
+ 		node = makeNode(Sort);
+ 	}
+ 	else
+ 	{
+ 		IncrementalSort    *incrementalSort;
+ 
+ 		incrementalSort = makeNode(IncrementalSort);
+ 		node = &incrementalSort->sort;
+ 		incrementalSort->skipCols = skipCols;
+ 	}
+ 
+ 	plan = &node->plan;
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5704,5710 ****
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5752,5758 ----
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5724,5730 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5772,5778 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5767,5773 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5815,5821 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5788,5794 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5836,5843 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5821,5827 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5870,5876 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** is_projection_capable_plan(Plan *plan)
*** 6469,6474 ****
--- 6518,6524 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index c3fbf3c..5fe1235
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 02286d9..b9f8997
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3508,3521 ****
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				bool		is_sorted;
  
! 				is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 												  path->pathkeys);
! 				if (path == cheapest_partial_path || is_sorted)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (!is_sorted)
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
--- 3508,3521 ----
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				int			n_useful_pathkeys;
  
! 				n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (n_useful_pathkeys < list_length(root->group_pathkeys))
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3588,3601 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3588,3601 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_useful_pathkeys;
  
! 			n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 			if (path == cheapest_path || n_useful_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4323,4335 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 4323,4335 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_useful_pathkeys;
  
! 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! 														 path->pathkeys);
! 		if (path == cheapest_input_path || n_useful_pathkeys > 0)
  		{
! 			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 5458,5465 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 5458,5466 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index 5f3027e..71fb394
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 623,628 ****
--- 623,629 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 6fa6540..2b7f081
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2698,2703 ****
--- 2698,2704 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_Gather:
  		case T_GatherMerge:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index 1389db1..0972d4b
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 963,969 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 963,970 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 8ce772d..e280f4b
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1294,1305 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1294,1306 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1313,1318 ****
--- 1314,1321 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1549,1555 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1552,1559 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1641,1646 ****
--- 1645,1651 ----
  	GatherMergePath *pathnode = makeNode(GatherMergePath);
  	Cost			 input_startup_cost = 0;
  	Cost			 input_total_cost = 0;
+ 	int				 n_common_pathkeys;
  
  	Assert(subpath->parallel_safe);
  	Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1657,1663 ****
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
--- 1662,1670 ----
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 
! 	if (n_common_pathkeys == list_length(pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1671,1676 ****
--- 1678,1685 ----
  		cost_sort(&sort_path,
  				  root,
  				  pathkeys,
+ 				  n_common_pathkeys,
+ 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  subpath->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2486,2494 ****
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode = makeNode(SortPath);
  
- 	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
--- 2495,2525 ----
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode;
! 	int			n_common_pathkeys;
! 
! 	if (enable_incrementalsort)
! 		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! 	else
! 		n_common_pathkeys = 0;
! 
! 	if (n_common_pathkeys == 0)
! 	{
! 		pathnode = makeNode(SortPath);
! 		pathnode->path.pathtype = T_Sort;
! 	}
! 	else
! 	{
! 		IncrementalSortPath   *incpathnode;
! 
! 		incpathnode = makeNode(IncrementalSortPath);
! 		pathnode = &incpathnode->spath;
! 		pathnode->path.pathtype = T_IncrementalSort;
! 		incpathnode->skipCols = n_common_pathkeys;
! 	}
! 
! 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2502,2508 ****
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2533,2541 ----
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2774,2780 ****
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
--- 2807,2814 ----
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL, 0,
! 					  0.0,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index e462fbd..fb54f27
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 277,283 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index bb9a544..735bd15
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3522,3527 ****
--- 3522,3563 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucketsize fraction (ie, number of entries in a bucket
   * divided by total tuples in relation) if the specified expression is used
   * as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 4feb26a..d4f5555
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 858,863 ****
--- 858,872 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of incremental sort steps."),
+ 			NULL
+ 		},
+ 		&enable_incrementalsort,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hashed aggregation plans."),
  			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index e1e692d..af93ae4
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 281,286 ****
--- 281,291 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	TupSortStatus maxStatus;	/* maximum status reached between sort groups */
+ 	int64		maxMem;			/* maximum amount of memory used between
+ 								   sort groups */
+ 	bool		maxMemOnDisk;	/* is maxMem value for on-disk memory */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 633,638 ****
--- 638,646 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 667,685 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 675,704 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 696,702 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 715,721 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 714,719 ****
--- 733,739 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 754,766 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 774,787 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 802,808 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 823,829 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 833,839 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 854,860 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 924,930 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 945,951 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 997,1003 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1018,1024 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1034,1040 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1055,1061 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1145,1160 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1166,1177 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1213,1219 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1230,1327 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	memUsed;
! 	bool	memUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		memUsedOnDisk = true;
! 		memUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		memUsedOnDisk = false;
! 		memUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	state->maxStatus = Max(state->maxStatus, state->status);
! 	if (memUsed > state->maxMem)
! 	{
! 		state->maxMem = memUsed;
! 		state->maxMemOnDisk = memUsedOnDisk;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->availMem = state->allowedMem;
! 	state->lastReturnedTuple = NULL;
! 	state->slabAllocatorUsed = false;
! 	state->slabMemoryBegin = NULL;
! 	state->slabMemoryEnd = NULL;
! 	state->slabFreeHead = NULL;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3219,3245 ****
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
  		*spaceType = "Disk";
- 		*spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		*spaceType = "Memory";
! 		*spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3327,3341 ----
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxMemOnDisk)
  		*spaceType = "Disk";
  	else
  		*spaceType = "Memory";
! 	*spaceUsed = (state->maxMem + 1023) / 1024;
  
! 	switch (state->maxStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncrementalSort.h
+  *
+  *
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/executor/nodeIncrementalSort.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+ 
+ #include "nodes/execnodes.h"
+ 
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ 													EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+ 
+ #endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index f856f60..347b551
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1899,1904 ****
--- 1899,1918 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1915,1920 ****
--- 1929,1954 ----
  	void	   *tuplesortstate; /* private state of tuplesort.c */
  } SortState;
  
+ /* ----------------
+  *	 IncrementalSortState information
+  * ----------------
+  */
+ typedef struct IncrementalSortState
+ {
+ 	ScanState	ss;				/* its first field is NodeTag */
+ 	bool		bounded;		/* is the result set bounded? */
+ 	int64		bound;			/* if bounded, how many tuples are needed */
+ 	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
+ 	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	int64		bound_Done;		/* value of bound we did the sort with */
+ 	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	int64		groupsCount;	/* number of groups with equal skip keys */
+ 	TupleTableSlot *prevSlot;	/* slot for previous tuple from outer node */
+ } IncrementalSortState;
+ 
  /* ---------------------
   *	GroupState information
   * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index 2bc7a5d..22b2c46
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 72,77 ****
--- 72,78 ----
  	T_HashJoin,
  	T_Material,
  	T_Sort,
+ 	T_IncrementalSort,
  	T_Group,
  	T_Agg,
  	T_WindowAgg,
*************** typedef enum NodeTag
*** 123,128 ****
--- 124,130 ----
  	T_HashJoinState,
  	T_MaterialState,
  	T_SortState,
+ 	T_IncrementalSortState,
  	T_GroupState,
  	T_AggState,
  	T_WindowAggState,
*************** typedef enum NodeTag
*** 255,260 ****
--- 257,263 ----
  	T_ProjectionPath,
  	T_ProjectSetPath,
  	T_SortPath,
+ 	T_IncrementalSortPath,
  	T_GroupPath,
  	T_UpperUniquePath,
  	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index b880dc1..990585e
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 711,716 ****
--- 711,727 ----
  	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
  } Sort;
  
+ 
+ /* ----------------
+  *		incremental sort node
+  * ----------------
+  */
+ typedef struct IncrementalSort
+ {
+ 	Sort		sort;
+ 	int			skipCols;		/* number of presorted columns */
+ } IncrementalSort;
+ 
  /* ---------------
   *	 group node -
   *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 05d6f07..b386697
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1344,1349 ****
--- 1344,1359 ----
  } SortPath;
  
  /*
+  * IncrementalSortPath
+  */
+ typedef struct IncrementalSortPath
+ {
+ 	SortPath	spath;
+ 	int			skipCols;
+ } IncrementalSortPath;
+ 
+ 
+ /*
   * GroupPath represents grouping (of presorted input)
   *
   * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index d9a9b12..06827e3
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_bitmapscan;
  extern bool enable_tidscan;
  extern bool enable_sort;
+ extern bool enable_incrementalsort;
  extern bool enable_hashagg;
  extern bool enable_nestloop;
  extern bool enable_material;
*************** extern void cost_ctescan(Path *path, Pla
*** 100,107 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 101,109 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 25fe78c..01073dd
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
  extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
  							  List *mergeclauses,
  							  List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
  extern List *truncate_useless_pathkeys(PlannerInfo *root,
  						  RelOptInfo *rel,
  						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
  						 double nbuckets);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5b3f475..616f9f5
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 62,69 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 104,109 ****
--- 105,112 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort           
*** 19,27 ****
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Sort           
    Sort Key: id, data
!   ->  Seq Scan on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
--- 19,28 ----
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Incremental Sort
    Sort Key: id, data
!   Presorted Key: id
!   ->  Index Scan using test_dc_pkey on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index 0ff8062..3ad5eb3
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t
*** 996,1010 ****
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                       QUERY PLAN                       
! -------------------------------------------------------
!  HashAggregate
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Merge Join
!          Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!          ->  Index Scan using t1_pkey on t1
!          ->  Index Scan using t2_pkey on t2
! (6 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
--- 996,1013 ----
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                          QUERY PLAN                          
! -------------------------------------------------------------
!  Group
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Incremental Sort
!          Sort Key: t1.a, t1.b, t2.z
!          Presorted Key: t1.a, t1.b
!          ->  Merge Join
!                Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!                ->  Index Scan using t1_pkey on t1
!                ->  Index Scan using t2_pkey on t2
! (9 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 6494b20..c3e2609
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE:  drop cascades to table matest1
*** 1454,1459 ****
--- 1454,1460 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
  SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1594,1602 ****
--- 1595,1639 ----
   {3,7,8,10,13,13,16,18,19,22}
  (3 rows)
  
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+                                QUERY PLAN                                
+ -------------------------------------------------------------------------
+  Merge Append
+    Sort Key: tenk1.thousand, tenk1.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+    ->  Incremental Sort
+          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
+          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Merge Append
+    Sort Key: a.thousand, a.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+    ->  Incremental Sort
+          Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
+          ->  Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  --
  -- Check that constraint exclusion works correctly with partitions using
  -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!          name         | setting 
! ----------------------+---------
!  enable_bitmapscan    | on
!  enable_gathermerge   | on
!  enable_hashagg       | on
!  enable_hashjoin      | on
!  enable_indexonlyscan | on
!  enable_indexscan     | on
!  enable_material      | on
!  enable_mergejoin     | on
!  enable_nestloop      | on
!  enable_seqscan       | on
!  enable_sort          | on
!  enable_tidscan       | on
! (12 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
--- 70,91 ----
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!           name          | setting 
! ------------------------+---------
!  enable_bitmapscan      | on
!  enable_gathermerge     | on
!  enable_hashagg         | on
!  enable_hashjoin        | on
!  enable_incrementalsort | on
!  enable_indexonlyscan   | on
!  enable_indexscan       | on
!  enable_material        | on
!  enable_mergejoin       | on
!  enable_nestloop        | on
!  enable_seqscan         | on
!  enable_sort            | on
!  enable_tidscan         | on
! (13 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index e3e9e34..0bf3c01
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 499,504 ****
--- 499,505 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
*************** SELECT
*** 560,568 ****
--- 561,586 ----
      ORDER BY f.i LIMIT 10)
  FROM generate_series(1, 3) g(i);
  
+ set enable_incrementalsort = on;
+ 
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  
  --
  -- Check that constraint exclusion works correctly with partitions using

Heikki Linnakangas

hlinnaka@iki.fi

almost 9 years ago

In reply to: Alexander Korotkov (#5)

Re: [PATCH] Incremental sort

On 03/20/2017 11:33 AM, Alexander Korotkov wrote:

Please, find rebased patch in the attachment.

I had a quick look at this.

* I'd love to have an explanation of what an Incremental Sort is, in the
file header comment for nodeIncrementalSort.c.

* I didn't understand the maxMem stuff in tuplesort.c. The comments
there use the phrase "on-disk memory", which seems like an oxymoron.
Also, "maximum status" seems weird, as it assumes that there's a natural
order to the states.

* In the below example, the incremental sort is significantly slower
than the Seq Scan + Sort you get otherwise:

create table foo (a int4, b int4, c int4);
insert into sorttest select g, g, g from generate_series(1, 1000000) g;
vacuum foo;
create index i_sorttest on sorttest (a, b, c);
set work_mem='100MB';

postgres=# explain select count(*) from (select * from sorttest order by
a, c) as t;
QUERY PLAN

-------------------------------------------------------------------------------------------------------
Aggregate (cost=138655.68..138655.69 rows=1 width=8)
-> Incremental Sort (cost=610.99..124870.38 rows=1102824 width=12)
Sort Key: sorttest.a, sorttest.c
Presorted Key: sorttest.a
-> Index Only Scan using i_sorttest on sorttest
(cost=0.43..53578.79 rows=1102824 width=12)
(5 rows)

Time: 0.409 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as t;
count
---------
1000000
(1 row)

Time: 387.091 ms

postgres=# explain select count(*) from (select * from sorttest order by
a, c) as t;
QUERY PLAN

-------------------------------------------------------------------------------
Aggregate (cost=130063.84..130063.85 rows=1 width=8)
-> Sort (cost=115063.84..117563.84 rows=1000000 width=12)
Sort Key: sorttest.a, sorttest.c
-> Seq Scan on sorttest (cost=0.00..15406.00 rows=1000000
width=12)
(4 rows)

Time: 0.345 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as t;
count
---------
1000000
(1 row)

Time: 231.668 ms

According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for when
the group contains exactly one group, and not put the tuple to the
tuplesort in that case. Or if we cannot ensure that the Incremental Sort
is actually faster, the cost model should probably be smarter, to avoid
picking an incremental sort when it's not a win.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

David Steele

david@pgmasters.net

almost 9 years ago

In reply to: Heikki Linnakangas (#6)

Re: [PATCH] Incremental sort

Hi Alexander,

On 3/20/17 10:19 AM, Heikki Linnakangas wrote:

On 03/20/2017 11:33 AM, Alexander Korotkov wrote:

Please, find rebased patch in the attachment.

I had a quick look at this.

<...>

According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for when
the group contains exactly one group, and not put the tuple to the
tuplesort in that case. Or if we cannot ensure that the Incremental Sort
is actually faster, the cost model should probably be smarter, to avoid
picking an incremental sort when it's not a win.

This thread has been idle for over a week. Please respond with a new
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked
"Returned with Feedback".

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 9 years ago

In reply to: David Steele (#7)

Re: [PATCH] Incremental sort

On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net> wrote:

Hi Alexander,

On 3/20/17 10:19 AM, Heikki Linnakangas wrote:

On 03/20/2017 11:33 AM, Alexander Korotkov wrote:

Please, find rebased patch in the attachment.

I had a quick look at this.

<...>

According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To

alleviate that, it might be worthwhile to add a special case for when
the group contains exactly one group, and not put the tuple to the
tuplesort in that case. Or if we cannot ensure that the Incremental Sort
is actually faster, the cost model should probably be smarter, to avoid
picking an incremental sort when it's not a win.

This thread has been idle for over a week. Please respond with a new
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked
"Returned with Feedback".

Thank you for reminder!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 9 years ago

In reply to: Heikki Linnakangas (#6)

1 attachment(s)

Re: [PATCH] Incremental sort

On Mon, Mar 20, 2017 at 5:19 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 03/20/2017 11:33 AM, Alexander Korotkov wrote:

Please, find rebased patch in the attachment.

I had a quick look at this.

* I'd love to have an explanation of what an Incremental Sort is, in the
file header comment for nodeIncrementalSort.c.

Done.

* I didn't understand the maxMem stuff in tuplesort.c. The comments there

use the phrase "on-disk memory", which seems like an oxymoron. Also,
"maximum status" seems weird, as it assumes that there's a natural order to
the states.

Variables were renamed.

* In the below example, the incremental sort is significantly slower than

the Seq Scan + Sort you get otherwise:

create table foo (a int4, b int4, c int4);
insert into sorttest select g, g, g from generate_series(1, 1000000) g;
vacuum foo;
create index i_sorttest on sorttest (a, b, c);
set work_mem='100MB';

postgres=# explain select count(*) from (select * from sorttest order by
a, c) as t;
QUERY PLAN
------------------------------------------------------------
-------------------------------------------
Aggregate (cost=138655.68..138655.69 rows=1 width=8)
-> Incremental Sort (cost=610.99..124870.38 rows=1102824 width=12)
Sort Key: sorttest.a, sorttest.c
Presorted Key: sorttest.a
-> Index Only Scan using i_sorttest on sorttest
(cost=0.43..53578.79 rows=1102824 width=12)
(5 rows)

Time: 0.409 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as
t;
count
---------
1000000
(1 row)

Time: 387.091 ms

postgres=# explain select count(*) from (select * from sorttest order by
a, c) as t;
QUERY PLAN
------------------------------------------------------------
-------------------
Aggregate (cost=130063.84..130063.85 rows=1 width=8)
-> Sort (cost=115063.84..117563.84 rows=1000000 width=12)
Sort Key: sorttest.a, sorttest.c
-> Seq Scan on sorttest (cost=0.00..15406.00 rows=1000000
width=12)
(4 rows)

Time: 0.345 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as
t;
count
---------
1000000
(1 row)

Time: 231.668 ms

According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for when the
group contains exactly one group, and not put the tuple to the tuplesort in
that case.

I'm not sure we should do such optimization for one tuple per group, since
it's similar situation with 2 or 3 tuples per group.

Or if we cannot ensure that the Incremental Sort is actually faster, the
cost model should probably be smarter, to avoid picking an incremental sort
when it's not a win.

I added to cost_sort() extra costing for incremental sort: cost of extra
tuple copying and comparing as well as cost of tuplesort reset.
The only problem is that I made following estimate for tuplesort reset:

run_cost += 10.0 * cpu_tuple_cost * num_groups;

It makes ordinal sort to be selected in your example, but it contains
constant 10 which is quite arbitrary. It would be nice to evade such hard
coded constants, but I don't know how could we calculate such cost
realistically.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-4.patchapplication/octet-stream; name=incremental-sort-4.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index a466bf2..1cabe3f
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1913,1951 ****
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
   Limit
!    Output: t1.c1, t2.c1
     ->  Sort
!          Output: t1.c1, t2.c1
!          Sort Key: t1.c1, t2.c1
           ->  Nested Loop
!                Output: t1.c1, t2.c1
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c1
!                      Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c1
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c1
!                            Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!  c1 | c1  
! ----+-----
!   1 | 101
!   1 | 102
!   1 | 103
!   1 | 104
!   1 | 105
!   1 | 106
!   1 | 107
!   1 | 108
!   1 | 109
!   1 | 110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
--- 1913,1951 ----
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!                             QUERY PLAN                            
! ------------------------------------------------------------------
   Limit
!    Output: t1.c3, t2.c3
     ->  Sort
!          Output: t1.c3, t2.c3
!          Sort Key: t1.c3, t2.c3
           ->  Nested Loop
!                Output: t1.c3, t2.c3
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c3
!                      Remote SQL: SELECT c3 FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c3
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c3
!                            Remote SQL: SELECT c3 FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!   c3   |  c3   
! -------+-------
!  00001 | 00101
!  00001 | 00102
!  00001 | 00103
!  00001 | 00104
!  00001 | 00105
!  00001 | 00106
!  00001 | 00107
!  00001 | 00108
!  00001 | 00109
!  00001 | 00110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2487,2504 ****
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                 QUERY PLAN                                                
! ----------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          ->  Foreign Scan
!                Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
!                Relations: Aggregate on (public.ft1)
!                Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
--- 2487,2507 ----
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                         QUERY PLAN                                                        
! --------------------------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Incremental Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          Presorted Key: ft1.c2
!          ->  GroupAggregate
!                Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
!                Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
!                ->  Foreign Scan on public.ft1
!                      Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
!                      Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 8f3edc1..a13d556
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 479,486 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 479,486 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index ac339fb..59763ab
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3563,3568 ****
--- 3563,3582 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+       <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of incremental sort
+         steps. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
        <term><varname>enable_indexscan</varname> (<type>boolean</type>)
        <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index ea19ba6..08222bc
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual, 
*** 79,84 ****
--- 79,86 ----
  				ExplainState *es);
  static void show_sort_keys(SortState *sortstate, List *ancestors,
  			   ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 					   List *ancestors, ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 92,98 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 94,100 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 100,105 ****
--- 102,109 ----
  static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
  				 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 									   ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  					ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 993,998 ****
--- 997,1005 ----
  		case T_Sort:
  			pname = sname = "Sort";
  			break;
+ 		case T_IncrementalSort:
+ 			pname = sname = "Incremental Sort";
+ 			break;
  		case T_Group:
  			pname = sname = "Group";
  			break;
*************** ExplainNode(PlanState *planstate, List *
*** 1565,1570 ****
--- 1572,1583 ----
  			show_sort_keys(castNode(SortState, planstate), ancestors, es);
  			show_sort_info(castNode(SortState, planstate), es);
  			break;
+ 		case T_IncrementalSort:
+ 			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ 									   ancestors, es);
+ 			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ 									   es);
+ 			break;
  		case T_MergeAppend:
  			show_merge_append_keys(castNode(MergeAppendState, planstate),
  								   ancestors, es);
*************** static void
*** 1890,1904 ****
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
   * Likewise, for a MergeAppend node.
   */
  static void
--- 1903,1940 ----
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+ 	int			skipCols;
+ 
+ 	if (IsA(plan, IncrementalSort))
+ 		skipCols = ((IncrementalSort *) plan)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
+  * Show the sort keys for a IncrementalSort node.
+  */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 						   List *ancestors, ExplainState *es)
+ {
+ 	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+ 
+ 	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ 						 plan->sort.numCols, plan->skipCols,
+ 						 plan->sort.sortColIdx,
+ 						 plan->sort.sortOperators, plan->sort.collations,
+ 						 plan->sort.nullsFirst,
+ 						 ancestors, es);
+ }
+ 
+ /*
   * Likewise, for a MergeAppend node.
   */
  static void
*************** show_merge_append_keys(MergeAppendState 
*** 1908,1914 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1944,1950 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1932,1938 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1968,1974 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 2001,2007 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 2037,2043 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2058,2064 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 2094,2100 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2071,2083 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 2107,2120 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2117,2125 ****
--- 2154,2166 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2277,2282 ****
--- 2318,2360 ----
  }
  
  /*
+  * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+  */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 						   ExplainState *es)
+ {
+ 	if (es->analyze && incrsortstate->sort_Done &&
+ 		incrsortstate->tuplesortstate != NULL)
+ 	{
+ 		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ 		const char *sortMethod;
+ 		const char *spaceType;
+ 		long		spaceUsed;
+ 
+ 		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+ 							 sortMethod, spaceType, spaceUsed);
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort groups: %ld\n",
+ 							 incrsortstate->groupsCount);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyText("Sort Method", sortMethod, es);
+ 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			ExplainPropertyLong("Sort groups: %ld",
+ 								incrsortstate->groupsCount, es);
+ 		}
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index d1c1324..5332e83
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
!        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
         nodeTableFuncscan.o
--- 24,32 ----
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
!        nodeSort.o nodeIncrementalSort.o \
!        nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
         nodeTableFuncscan.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 5d59f95..e04175a
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 243,248 ****
--- 244,253 ----
  			ExecReScanSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecReScanIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecReScanGroup((GroupState *) node);
  			break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 514,521 ****
--- 519,530 ----
  		case T_CteScan:
  		case T_Material:
  		case T_Sort:
+ 			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_IncrementalSort:
+ 			return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 80c77ad..1fa1de4
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 93,98 ****
--- 93,99 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 302,307 ****
--- 303,313 ----
  												estate, eflags);
  			break;
  
+ 		case T_IncrementalSort:
+ 			result = (PlanState *) ExecInitIncrementalSort(
+ 									(IncrementalSort *) node, estate, eflags);
+ 			break;
+ 
  		case T_Group:
  			result = (PlanState *) ExecInitGroup((Group *) node,
  												 estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 521,526 ****
--- 527,536 ----
  			result = ExecSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			result = ExecIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			result = ExecGroup((GroupState *) node);
  			break;
*************** ExecEndNode(PlanState *node)
*** 789,794 ****
--- 799,808 ----
  			ExecEndSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecEndIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecEndGroup((GroupState *) node);
  			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index ef35da6..afb5cb2
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 655,660 ****
--- 655,661 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 733,739 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 734,740 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...5aa2c62
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,527 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncremenalSort.c
+  *	  Routines to handle incremental sorting of relations.
+  *
+  * DESCRIPTION
+  *
+  *		Incremental sort is specially optimized kind of multikey sort when
+  *		input is already presorted by prefix of required keys list.  Thus,
+  *		when it's required to sort by (key1, key2 ... keyN) and result is
+  *		already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+  *		values of (key1, key2 ... keyM) are equal.
+  *
+  *		Consider following example.  We have input tuples consisting from
+  *		two integers (x, y) already presorted by x, while it's required to
+  *		sort them by x and y.  Let input tuples be following.
+  *
+  *		(1, 5)
+  *		(1, 2)
+  *		(2, 10)
+  *		(2, 1)
+  *		(2, 5)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort algorithm would sort by xfollowing groups, which have
+  *		equal x, individually:
+  *			(1, 5) (1, 2)
+  *			(2, 10) (2, 1) (2, 5)
+  *			(3, 3) (3, 7)
+  *
+  *		After sorting these groups and putting them altogether, we would get
+  *		following tuple set which is actually sorted by x and y.
+  *
+  *		(1, 2)
+  *		(1, 5)
+  *		(2, 1)
+  *		(2, 5)
+  *		(2, 10)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort is faster than full sort on large datasets.  But
+  *		the case of most huge benefit of incremental sort is queries with
+  *		LIMIT because incremental sort can return first tuples without reading
+  *		whole input dataset.
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/executor/nodeIncremenalSort.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+ 
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ 															TupleTableSlot *b)
+ {
+ 	int n, i;
+ 
+ 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+ 
+ 	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = slot_getattr(a, attno, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	int					skipCols,
+ 						i;
+ 
+ 	Assert(IsA(plannode, IncrementalSort));
+ 	skipCols = plannode->skipCols;
+ 
+ 	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sort.sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 										plannode->sort.sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sort.sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->sort.collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
+ 
+ /* ----------------------------------------------------------------
+  *		ExecIncrementalSort
+  *
+  *		Assuming that outer subtree returns tuple presorted by some prefix
+  *		of target sort columns, performs incremental sort.  It fetches
+  *		groups of tuples where prefix sort columns are equal and sorts them
+  *		using tuplesort.  This approach allows to evade sorting of whole
+  *		dataset.  Besides taking less memory and being faster, it allows to
+  *		start returning tuples before fetching full dataset from outer
+  *		subtree.
+  *
+  *		Conditions:
+  *		  -- none.
+  *
+  *		Initial States:
+  *		  -- the outer child is prepared to return the first tuple.
+  * ----------------------------------------------------------------
+  */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ 	EState			   *estate;
+ 	ScanDirection		dir;
+ 	Tuplesortstate	   *tuplesortstate;
+ 	TupleTableSlot	   *slot;
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	PlanState		   *outerNode;
+ 	int					skipCols;
+ 	TupleDesc			tupDesc;
+ 	int64				nTuples = 0;
+ 
+ 	skipCols = plannode->skipCols;
+ 
+ 	/*
+ 	 * get state info from node
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "entering routine");
+ 
+ 	estate = node->ss.ps.state;
+ 	dir = estate->es_direction;
+ 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+ 
+ 	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
+ 	 * If first time through, read all tuples from outer plan and pass them to
+ 	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ 	 */
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "sorting subplan");
+ 
+ 	/*
+ 	 * Want to scan subplan in the forward direction while creating the
+ 	 * sorted data.
+ 	 */
+ 	estate->es_direction = ForwardScanDirection;
+ 
+ 	/*
+ 	 * Initialize tuplesort module.
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "calling tuplesort_begin");
+ 
+ 	outerNode = outerPlanState(node);
+ 	tupDesc = ExecGetResultType(outerNode);
+ 
+ 	if (node->tuplesortstate == NULL)
+ 	{
+ 		/*
+ 		 * We are going to process the first group of presorted data.
+ 		 * Initialize support structures for cmpSortSkipCols - already
+ 		 * sorted columns.
+ 		 */
+ 		prepareSkipCols(node);
+ 
+ 		/*
+ 		 * Only pass on remaining columns that are unsorted.  Skip
+ 		 * abbreviated keys usage for incremental sort.  We unlikely will
+ 		 * have huge groups with incremental sort.  Therefore usage of
+ 		 * abbreviated keys would be likely a waste of time.
+ 		 */
+ 		tuplesortstate = tuplesort_begin_heap(
+ 									tupDesc,
+ 									plannode->sort.numCols - skipCols,
+ 									&(plannode->sort.sortColIdx[skipCols]),
+ 									&(plannode->sort.sortOperators[skipCols]),
+ 									&(plannode->sort.collations[skipCols]),
+ 									&(plannode->sort.nullsFirst[skipCols]),
+ 									work_mem,
+ 									false,
+ 									true);
+ 		node->tuplesortstate = (void *) tuplesortstate;
+ 		node->groupsCount++;
+ 	}
+ 	else
+ 	{
+ 		/* Next group of presorted data */
+ 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ 		node->groupsCount++;
+ 	}
+ 
+ 	/* Calculate remaining bound for bounded sort */
+ 	if (node->bounded)
+ 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ 
+ 	/*
+ 	 * Put next group of tuples where skipCols sort values are equal to
+ 	 * tuplesort.
+ 	 */
+ 	for (;;)
+ 	{
+ 		slot = ExecProcNode(outerNode);
+ 
+ 		/* Put next group of presorted data to the tuplesort */
+ 		if (TupIsNull(node->prevSlot))
+ 		{
+ 			/* First tuple */
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			else
+ 			{
+ 				ExecCopySlot(node->prevSlot, slot);
+ 			}
+ 		}
+ 		else
+ 		{
+ 			/* Put previous tuple into tuplesort */
+ 			tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ 			nTuples++;
+ 
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			else
+ 			{
+ 				bool	cmp;
+ 				cmp = cmpSortSkipCols(node, node->prevSlot, slot);
+ 
+ 				/* Replace previous tuple with current one */
+ 				ExecCopySlot(node->prevSlot, slot);
+ 
+ 				/*
+ 				 * When skipCols are not equal then group of presorted data
+ 				 * is finished
+ 				 */
+ 				if (!cmp)
+ 					break;
+ 			}
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Complete the sort.
+ 	 */
+ 	tuplesort_performsort(tuplesortstate);
+ 
+ 	/*
+ 	 * restore to user specified direction
+ 	 */
+ 	estate->es_direction = dir;
+ 
+ 	/*
+ 	 * finally set the sorted flag to true
+ 	 */
+ 	node->sort_Done = true;
+ 	node->bounded_Done = node->bounded;
+ 
+ 	/*
+ 	 * Adjust bound_Done with number of tuples we've actually sorted.
+ 	 */
+ 	if (node->bounded)
+ 	{
+ 		if (node->finished)
+ 			node->bound_Done = node->bound;
+ 		else
+ 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ 	}
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "retrieving tuple from tuplesort");
+ 
+ 	/*
+ 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+ 	 * tuples.
+ 	 */
+ 	slot = node->ss.ps.ps_ResultTupleSlot;
+ 	(void) tuplesort_gettupleslot(tuplesortstate,
+ 								  ScanDirectionIsForward(dir),
+ 								  slot, NULL);
+ 	return slot;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitIncrementalSort
+  *
+  *		Creates the run-time state information for the sort node
+  *		produced by the planner and initializes its outer subtree.
+  * ----------------------------------------------------------------
+  */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ 	IncrementalSortState   *incrsortstate;
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "initializing sort node");
+ 
+ 	/*
+ 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ 	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ 	 * bucket in tuplesortstate.
+ 	 */
+ 	Assert((eflags & (EXEC_FLAG_REWIND |
+ 					  EXEC_FLAG_BACKWARD |
+ 					  EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
+ 	 * create state structure
+ 	 */
+ 	incrsortstate = makeNode(IncrementalSortState);
+ 	incrsortstate->ss.ps.plan = (Plan *) node;
+ 	incrsortstate->ss.ps.state = estate;
+ 
+ 	incrsortstate->bounded = false;
+ 	incrsortstate->sort_Done = false;
+ 	incrsortstate->finished = false;
+ 	incrsortstate->tuplesortstate = NULL;
+ 	incrsortstate->prevSlot = NULL;
+ 	incrsortstate->bound_Done = 0;
+ 	incrsortstate->groupsCount = 0;
+ 	incrsortstate->skipKeys = NULL;
+ 
+ 	/*
+ 	 * Miscellaneous initialization
+ 	 *
+ 	 * Sort nodes don't initialize their ExprContexts because they never call
+ 	 * ExecQual or ExecProject.
+ 	 */
+ 
+ 	/*
+ 	 * tuple table initialization
+ 	 *
+ 	 * sort nodes only return scan tuples from their sorted relation.
+ 	 */
+ 	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ 	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+ 
+ 	/*
+ 	 * initialize child nodes
+ 	 *
+ 	 * We shield the child node from the need to support REWIND, BACKWARD, or
+ 	 * MARK/RESTORE.
+ 	 */
+ 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+ 
+ 	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+ 
+ 	/*
+ 	 * initialize tuple type.  no need to initialize projection info because
+ 	 * this node doesn't do projections.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ 	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+ 
+ 	/* make standalone slot to store previous tuple from outer node */
+ 	incrsortstate->prevSlot = MakeSingleTupleTableSlot(
+ 							ExecGetResultType(outerPlanState(incrsortstate)));
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "sort node initialized");
+ 
+ 	return incrsortstate;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecEndIncrementalSort(node)
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "shutting down sort node");
+ 
+ 	/*
+ 	 * clean out the tuple table
+ 	 */
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	/* must drop stanalone tuple slot from outer node */
+ 	ExecDropSingleTupleTableSlot(node->prevSlot);
+ 
+ 	/*
+ 	 * Release tuplesort resources
+ 	 */
+ 	if (node->tuplesortstate != NULL)
+ 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 
+ 	/*
+ 	 * shut down the subplan
+ 	 */
+ 	ExecEndNode(outerPlanState(node));
+ 
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "sort node shutdown");
+ }
+ 
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ 	PlanState  *outerPlan = outerPlanState(node);
+ 
+ 	/*
+ 	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ 	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ 	 * re-scan it at all.
+ 	 */
+ 	if (!node->sort_Done)
+ 		return;
+ 
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 
+ 	/*
+ 	 * If subnode is to be rescanned then we forget previous sort results; we
+ 	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+ 	 * bounded-sort parameters changed or we didn't select randomAccess.
+ 	 *
+ 	 * Otherwise we can just rewind and rescan the sorted output.
+ 	 */
+ 	node->sort_Done = false;
+ 	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 	node->bound_Done = 0;
+ 
+ 	/*
+ 	 * if chgParam of subnode is not null then plan will be re-scanned by
+ 	 * first ExecProcNode.
+ 	 */
+ 	if (outerPlan->chgParam == NULL)
+ 		ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 591a31a..cf228d6
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index c23d5c5..be3748d
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 889,894 ****
--- 889,912 ----
  
  
  /*
+  * CopySortFields
+  *
+  *		This function copies the fields of the Sort node.  It is used by
+  *		all the copy functions for classes which inherit from Sort.
+  */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ 
+ 	COPY_SCALAR_FIELD(numCols);
+ 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+ 
+ /*
   * _copySort
   */
  static Sort *
*************** _copySort(const Sort *from)
*** 899,911 ****
  	/*
  	 * copy node superclass fields
  	 */
! 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
! 	COPY_SCALAR_FIELD(numCols);
! 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
  
  	return newnode;
  }
--- 917,945 ----
  	/*
  	 * copy node superclass fields
  	 */
! 	CopySortFields(from, newnode);
  
! 	return newnode;
! }
! 
! 
! /*
!  * _copyIncrementalSort
!  */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! 	IncrementalSort	   *newnode = makeNode(IncrementalSort);
! 
! 	/*
! 	 * copy node superclass fields
! 	 */
! 	CopySortFields((const Sort *) from, (Sort *) newnode);
! 
! 	/*
! 	 * copy remainder of node
! 	 */
! 	COPY_SCALAR_FIELD(skipCols);
  
  	return newnode;
  }
*************** copyObject(const void *from)
*** 4733,4738 ****
--- 4767,4775 ----
  		case T_Sort:
  			retval = _copySort(from);
  			break;
+ 		case T_IncrementalSort:
+ 			retval = _copyIncrementalSort(from);
+ 			break;
  		case T_Group:
  			retval = _copyGroup(from);
  			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index bbb63a4..7dfa56f
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 826,837 ****
  }
  
  static void
! _outSort(StringInfo str, const Sort *node)
  {
  	int			i;
  
- 	WRITE_NODE_TYPE("SORT");
- 
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
--- 826,835 ----
  }
  
  static void
! _outSortInfo(StringInfo str, const Sort *node)
  {
  	int			i;
  
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 854,859 ****
--- 852,875 ----
  }
  
  static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ 	WRITE_NODE_TYPE("SORT");
+ 
+ 	_outSortInfo(str, node);
+ }
+ 
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ 	WRITE_NODE_TYPE("INCREMENTALSORT");
+ 
+ 	_outSortInfo(str, (const Sort *) node);
+ 
+ 	WRITE_INT_FIELD(skipCols);
+ }
+ 
+ static void
  _outUnique(StringInfo str, const Unique *node)
  {
  	int			i;
*************** outNode(StringInfo str, const void *obj)
*** 3656,3661 ****
--- 3672,3680 ----
  			case T_Sort:
  				_outSort(str, obj);
  				break;
+ 			case T_IncrementalSort:
+ 				_outIncrementalSort(str, obj);
+ 				break;
  			case T_Unique:
  				_outUnique(str, obj);
  				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 474f221..40b712e
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2025,2036 ****
  }
  
  /*
!  * _readSort
   */
! static Sort *
! _readSort(void)
  {
! 	READ_LOCALS(Sort);
  
  	ReadCommonPlan(&local_node->plan);
  
--- 2025,2037 ----
  }
  
  /*
!  * ReadCommonSort
!  *	Assign the basic stuff of all nodes that inherit from Sort
   */
! static void
! ReadCommonSort(Sort *local_node)
  {
! 	READ_TEMP_LOCALS();
  
  	ReadCommonPlan(&local_node->plan);
  
*************** _readSort(void)
*** 2039,2044 ****
--- 2040,2071 ----
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
  	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+ 
+ /*
+  * _readSort
+  */
+ static Sort *
+ _readSort(void)
+ {
+ 	READ_LOCALS_NO_FIELDS(Sort);
+ 
+ 	ReadCommonSort(local_node);
+ 
+ 	READ_DONE();
+ }
+ 
+ /*
+  * _readIncrementalSort
+  */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ 	READ_LOCALS(IncrementalSort);
+ 
+ 	ReadCommonSort(&local_node->sort);
+ 
+ 	READ_INT_FIELD(skipCols);
  
  	READ_DONE();
  }
*************** parseNodeString(void)
*** 2591,2596 ****
--- 2618,2625 ----
  		return_value = _readMaterial();
  	else if (MATCH("SORT", 4))
  		return_value = _readSort();
+ 	else if (MATCH("INCREMENTALSORT", 7))
+ 		return_value = _readIncrementalSort();
  	else if (MATCH("GROUP", 5))
  		return_value = _readGroup();
  	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index a1e1a87..aca363d
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3234,3239 ****
--- 3234,3243 ----
  			ptype = "Sort";
  			subpath = ((SortPath *) path)->subpath;
  			break;
+ 		case T_IncrementalSortPath:
+ 			ptype = "IncrementalSort";
+ 			subpath = ((SortPath *) path)->subpath;
+ 			break;
  		case T_GroupPath:
  			ptype = "Group";
  			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 92de2b7..50f4502
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
  bool		enable_bitmapscan = true;
  bool		enable_tidscan = true;
  bool		enable_sort = true;
+ bool		enable_incrementalsort = true;
  bool		enable_hashagg = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
*************** cost_recursive_union(Path *runion, Path 
*** 1563,1568 ****
--- 1564,1576 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or incremental sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1589,1595 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1597,1604 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1605,1623 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
  
  	path->rows = tuples;
  
--- 1614,1641 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
+ 	if (!enable_incrementalsort)
+ 		presorted_keys = false;
  
  	path->rows = tuples;
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1643,1655 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1661,1710 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1659,1665 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1714,1720 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1670,1679 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1725,1734 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1681,1694 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1736,1768 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/*
! 		 * We'll use plain quicksort on all the input tuples.  If it appears
! 		 * that we expect less than two tuples per sort group then assume
! 		 * logarithmic part of estimate to be 1.
! 		 */
! 		if (group_tuples >= 2.0)
! 			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! 		else
! 			group_cost = comparison_cost * group_tuples;
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1699,1704 ****
--- 1773,1791 ----
  	 */
  	run_cost += cpu_operator_cost * tuples;
  
+ 	/* Extra costs of incremental sort */
+ 	if (presorted_keys > 0)
+ 	{
+ 		/*
+ 		 * In incremental sort case we also have to cost to detect sort groups.
+ 		 * It turns out into extra copy and comparison for each tuple.
+ 		 */
+ 		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+ 
+ 		/* Cost of per group tuplesort reset */
+ 		run_cost += 10.0 * cpu_tuple_cost * num_groups;
+ 	}
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2452,2457 ****
--- 2539,2546 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2478,2483 ****
--- 2567,2574 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 2c26906..2da6f40
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
  	return PATHKEYS_EQUAL;
  }
  
+ 
+ /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
  /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 402,413 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied only partially then we would have to do
!  *	  incremental sort in order to satisfy pathkeys completely.  Since
!  *	  incremental sort consumes data by presorted groups, we would have to
!  *	  consume more data than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
! 	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
- 
- 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1521,1562 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by incremental sort.
   */
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
  {
! 	int	n_common_pathkeys;
! 
! 	if (query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
! 
! 	if (enable_incrementalsort)
  	{
! 		/*
! 		 * Return the number of path keys in common, or 0 if there are none. Any
! 		 * first common pathkeys could be useful for ordering because we can use
! 		 * incremental sort.
! 		 */
! 		return n_common_pathkeys;
! 	}
! 	else
! 	{
! 		/* 
! 		 * When incremental sort is disabled, pathkeys are useful only when they
! 		 * do contain all the query pathkeys.
! 		 */
! 		if (n_common_pathkeys == list_length(query_pathkeys))
! 			return n_common_pathkeys;
! 		else
! 			return 0;
  	}
  }
  
  /*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
--- 1572,1578 ----
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index aafec58..9535622
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 232,238 ****
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 232,238 ----
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 247,256 ****
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 247,258 ----
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 431,436 ****
--- 433,439 ----
  											   (GatherPath *) best_path);
  			break;
  		case T_Sort:
+ 		case T_IncrementalSort:
  			plan = (Plan *) create_sort_plan(root,
  											 (SortPath *) best_path,
  											 flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1087,1092 ****
--- 1090,1096 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1121,1129 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1125,1135 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1471,1476 ****
--- 1477,1483 ----
  	Plan	   *subplan;
  	List	   *pathkeys = best_path->path.pathkeys;
  	List	   *tlist = build_path_tlist(root, &best_path->path);
+ 	int			n_common_pathkeys;
  
  	/* As with Gather, it's best to project away columns in the workers. */
  	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1497,1508 ****
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
--- 1504,1519 ----
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! 	if (n_common_pathkeys < list_length(pathkeys))
! 	{
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ 									 n_common_pathkeys,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
+ 	}
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1610,1615 ****
--- 1621,1627 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1619,1625 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1631,1641 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1863,1869 ****
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan);
  			}
  
  			if (!rollup->is_hashed)
--- 1879,1886 ----
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan,
! 											 0);
  			}
  
  			if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3755,3762 ****
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3772,3785 ----
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3767,3774 ****
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3790,3803 ----
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4820,4826 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4849,4856 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5380,5392 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node = makeNode(Sort);
! 	Plan	   *plan = &node->plan;
  
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
--- 5410,5440 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node;
! 	Plan	   *plan;
  
+ 	/* Always use regular sort node when enable_incrementalsort = false */
+ 	if (!enable_incrementalsort)
+ 		skipCols = 0;
+ 
+ 	if (skipCols == 0)
+ 	{
+ 		node = makeNode(Sort);
+ 	}
+ 	else
+ 	{
+ 		IncrementalSort    *incrementalSort;
+ 
+ 		incrementalSort = makeNode(IncrementalSort);
+ 		node = &incrementalSort->sort;
+ 		incrementalSort->skipCols = skipCols;
+ 	}
+ 
+ 	plan = &node->plan;
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5718,5724 ****
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5766,5772 ----
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5738,5744 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5786,5792 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5781,5787 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5829,5835 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5802,5808 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5850,5857 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5835,5841 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5884,5890 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** is_projection_capable_plan(Plan *plan)
*** 6484,6489 ****
--- 6533,6539 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index c3fbf3c..5fe1235
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index fa7a5f8..33fd370
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3751,3764 ****
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				bool		is_sorted;
  
! 				is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 												  path->pathkeys);
! 				if (path == cheapest_partial_path || is_sorted)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (!is_sorted)
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
--- 3751,3764 ----
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				int			n_useful_pathkeys;
  
! 				n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (n_useful_pathkeys < list_length(root->group_pathkeys))
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3831,3844 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3831,3844 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_useful_pathkeys;
  
! 			n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 			if (path == cheapest_path || n_useful_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4905,4917 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 4905,4917 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_useful_pathkeys;
  
! 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! 														 path->pathkeys);
! 		if (path == cheapest_input_path || n_useful_pathkeys > 0)
  		{
! 			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 6040,6047 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 6040,6048 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index 5930747..01d1328
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 623,628 ****
--- 623,629 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 6fa6540..2b7f081
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2698,2703 ****
--- 2698,2704 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_Gather:
  		case T_GatherMerge:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index d88738e..9ae0c88
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 963,969 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 963,970 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 999ebce..9769a5c
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1297,1308 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1297,1309 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1316,1321 ****
--- 1317,1324 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1552,1558 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1555,1562 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1644,1649 ****
--- 1648,1654 ----
  	GatherMergePath *pathnode = makeNode(GatherMergePath);
  	Cost			 input_startup_cost = 0;
  	Cost			 input_total_cost = 0;
+ 	int				 n_common_pathkeys;
  
  	Assert(subpath->parallel_safe);
  	Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1660,1666 ****
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
--- 1665,1673 ----
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 
! 	if (n_common_pathkeys == list_length(pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1674,1679 ****
--- 1681,1688 ----
  		cost_sort(&sort_path,
  				  root,
  				  pathkeys,
+ 				  n_common_pathkeys,
+ 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  subpath->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2489,2497 ****
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode = makeNode(SortPath);
  
- 	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
--- 2498,2528 ----
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode;
! 	int			n_common_pathkeys;
! 
! 	if (enable_incrementalsort)
! 		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! 	else
! 		n_common_pathkeys = 0;
! 
! 	if (n_common_pathkeys == 0)
! 	{
! 		pathnode = makeNode(SortPath);
! 		pathnode->path.pathtype = T_Sort;
! 	}
! 	else
! 	{
! 		IncrementalSortPath   *incpathnode;
! 
! 		incpathnode = makeNode(IncrementalSortPath);
! 		pathnode = &incpathnode->spath;
! 		pathnode->path.pathtype = T_IncrementalSort;
! 		incpathnode->skipCols = n_common_pathkeys;
! 	}
! 
! 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2505,2511 ****
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2536,2544 ----
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2813,2819 ****
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
--- 2846,2853 ----
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL, 0,
! 						  0.0,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index e462fbd..fb54f27
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 277,283 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 5c382a2..6426e44
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3567,3572 ****
--- 3567,3608 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucketsize fraction (ie, number of entries in a bucket
   * divided by total tuples in relation) if the specified expression is used
   * as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index e9d561b..1e8572d
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 858,863 ****
--- 858,872 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of incremental sort steps."),
+ 			NULL
+ 		},
+ 		&enable_incrementalsort,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hashed aggregation plans."),
  			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index e1e692d..ed189c2
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 281,286 ****
--- 281,293 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	int64		maxSpace;		/* maximum amount of space occupied among sort
+ 								   of groups, either in-memory or on-disk */
+ 	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+ 								   space, fase when it's value for in-memory
+ 								   space */
+ 	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 633,638 ****
--- 640,648 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 667,685 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 677,706 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 696,702 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 717,723 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 714,719 ****
--- 735,741 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 754,766 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 776,789 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 802,808 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 825,831 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 833,839 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 856,862 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 924,930 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 947,953 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 997,1003 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1020,1026 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1034,1040 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1057,1063 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1145,1160 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1168,1179 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1213,1219 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1232,1329 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	spaceUsed;
! 	bool	spaceUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		spaceUsedOnDisk = true;
! 		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		spaceUsedOnDisk = false;
! 		spaceUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	if (spaceUsed > state->maxSpace)
! 	{
! 		state->maxSpace = spaceUsed;
! 		state->maxSpaceOnDisk = spaceUsedOnDisk;
! 		state->maxSpaceStatus = state->status;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->availMem = state->allowedMem;
! 	state->lastReturnedTuple = NULL;
! 	state->slabAllocatorUsed = false;
! 	state->slabMemoryBegin = NULL;
! 	state->slabMemoryEnd = NULL;
! 	state->slabFreeHead = NULL;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3219,3245 ****
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
  		*spaceType = "Disk";
- 		*spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		*spaceType = "Memory";
! 		*spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3329,3343 ----
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxSpaceOnDisk)
  		*spaceType = "Disk";
  	else
  		*spaceType = "Memory";
! 	*spaceUsed = (state->maxSpace + 1023) / 1024;
  
! 	switch (state->maxSpaceStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncrementalSort.h
+  *
+  *
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/executor/nodeIncrementalSort.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+ 
+ #include "nodes/execnodes.h"
+ 
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ 													EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+ 
+ #endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 11a6850..06184f4
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1655,1660 ****
--- 1655,1674 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1671,1676 ****
--- 1685,1710 ----
  	void	   *tuplesortstate; /* private state of tuplesort.c */
  } SortState;
  
+ /* ----------------
+  *	 IncrementalSortState information
+  * ----------------
+  */
+ typedef struct IncrementalSortState
+ {
+ 	ScanState	ss;				/* its first field is NodeTag */
+ 	bool		bounded;		/* is the result set bounded? */
+ 	int64		bound;			/* if bounded, how many tuples are needed */
+ 	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
+ 	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	int64		bound_Done;		/* value of bound we did the sort with */
+ 	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	int64		groupsCount;	/* number of groups with equal skip keys */
+ 	TupleTableSlot *prevSlot;	/* slot for previous tuple from outer node */
+ } IncrementalSortState;
+ 
  /* ---------------------
   *	GroupState information
   * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index b9369ac..e550f26
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 72,77 ****
--- 72,78 ----
  	T_HashJoin,
  	T_Material,
  	T_Sort,
+ 	T_IncrementalSort,
  	T_Group,
  	T_Agg,
  	T_WindowAgg,
*************** typedef enum NodeTag
*** 123,128 ****
--- 124,130 ----
  	T_HashJoinState,
  	T_MaterialState,
  	T_SortState,
+ 	T_IncrementalSortState,
  	T_GroupState,
  	T_AggState,
  	T_WindowAggState,
*************** typedef enum NodeTag
*** 237,242 ****
--- 239,245 ----
  	T_ProjectionPath,
  	T_ProjectSetPath,
  	T_SortPath,
+ 	T_IncrementalSortPath,
  	T_GroupPath,
  	T_UpperUniquePath,
  	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 6e531b6..4959f95
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 720,725 ****
--- 720,736 ----
  	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
  } Sort;
  
+ 
+ /* ----------------
+  *		incremental sort node
+  * ----------------
+  */
+ typedef struct IncrementalSort
+ {
+ 	Sort		sort;
+ 	int			skipCols;		/* number of presorted columns */
+ } IncrementalSort;
+ 
  /* ---------------
   *	 group node -
   *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 8930edf..4bf6f3a
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1371,1376 ****
--- 1371,1386 ----
  } SortPath;
  
  /*
+  * IncrementalSortPath
+  */
+ typedef struct IncrementalSortPath
+ {
+ 	SortPath	spath;
+ 	int			skipCols;
+ } IncrementalSortPath;
+ 
+ 
+ /*
   * GroupPath represents grouping (of presorted input)
   *
   * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index d9a9b12..06827e3
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_bitmapscan;
  extern bool enable_tidscan;
  extern bool enable_sort;
+ extern bool enable_incrementalsort;
  extern bool enable_hashagg;
  extern bool enable_nestloop;
  extern bool enable_material;
*************** extern void cost_ctescan(Path *path, Pla
*** 100,107 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 101,109 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 25fe78c..01073dd
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
  extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
  							  List *mergeclauses,
  							  List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
  extern List *truncate_useless_pathkeys(PlannerInfo *root,
  						  RelOptInfo *rel,
  						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
  						 double nbuckets);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5b3f475..616f9f5
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 62,69 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 104,109 ****
--- 105,112 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort           
*** 19,27 ****
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Sort           
    Sort Key: id, data
!   ->  Seq Scan on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
--- 19,28 ----
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Incremental Sort
    Sort Key: id, data
!   Presorted Key: id
!   ->  Index Scan using test_dc_pkey on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 6163ed8..9553648
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE:  drop cascades to table matest1
*** 1493,1498 ****
--- 1493,1499 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
  SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1633,1641 ****
--- 1634,1678 ----
   {3,7,8,10,13,13,16,18,19,22}
  (3 rows)
  
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+                                QUERY PLAN                                
+ -------------------------------------------------------------------------
+  Merge Append
+    Sort Key: tenk1.thousand, tenk1.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+    ->  Incremental Sort
+          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
+          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Merge Append
+    Sort Key: a.thousand, a.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+    ->  Incremental Sort
+          Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
+          ->  Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  --
  -- Check that constraint exclusion works correctly with partitions using
  -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!          name         | setting 
! ----------------------+---------
!  enable_bitmapscan    | on
!  enable_gathermerge   | on
!  enable_hashagg       | on
!  enable_hashjoin      | on
!  enable_indexonlyscan | on
!  enable_indexscan     | on
!  enable_material      | on
!  enable_mergejoin     | on
!  enable_nestloop      | on
!  enable_seqscan       | on
!  enable_sort          | on
!  enable_tidscan       | on
! (12 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
--- 70,91 ----
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!           name          | setting 
! ------------------------+---------
!  enable_bitmapscan      | on
!  enable_gathermerge     | on
!  enable_hashagg         | on
!  enable_hashjoin        | on
!  enable_incrementalsort | on
!  enable_indexonlyscan   | on
!  enable_indexscan       | on
!  enable_material        | on
!  enable_mergejoin       | on
!  enable_nestloop        | on
!  enable_seqscan         | on
!  enable_sort            | on
!  enable_tidscan         | on
! (13 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index d43b75c..ec611f5
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 527,532 ****
--- 527,533 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
*************** SELECT
*** 588,596 ****
--- 589,614 ----
      ORDER BY f.i LIMIT 10)
  FROM generate_series(1, 3) g(i);
  
+ set enable_incrementalsort = on;
+ 
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  
  --
  -- Check that constraint exclusion works correctly with partitions using

#10

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Alexander Korotkov (#8)

Re: [PATCH] Incremental sort

On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:

On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net> wrote:

Hi Alexander,

On 3/20/17 10:19 AM, Heikki Linnakangas wrote:

On 03/20/2017 11:33 AM, Alexander Korotkov wrote:

Please, find rebased patch in the attachment.

I had a quick look at this.

<...>

According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To

alleviate that, it might be worthwhile to add a special case for when
the group contains exactly one group, and not put the tuple to the
tuplesort in that case. Or if we cannot ensure that the Incremental Sort
is actually faster, the cost model should probably be smarter, to avoid
picking an incremental sort when it's not a win.

This thread has been idle for over a week. Please respond with a new
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked
"Returned with Feedback".

Thank you for reminder!

I've just done so. Please resubmit once updated, it's a cool feature.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 9 years ago

In reply to: Andres Freund (#10)

1 attachment(s)

Re: [PATCH] Incremental sort

On Mon, Apr 3, 2017 at 9:34 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:

On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net>

wrote:

On 3/20/17 10:19 AM, Heikki Linnakangas wrote:

On 03/20/2017 11:33 AM, Alexander Korotkov wrote:

Please, find rebased patch in the attachment.

I had a quick look at this.

<...>

According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To

alleviate that, it might be worthwhile to add a special case for when
the group contains exactly one group, and not put the tuple to the
tuplesort in that case. Or if we cannot ensure that the Incremental

Sort

is actually faster, the cost model should probably be smarter, to

avoid

picking an incremental sort when it's not a win.

This thread has been idle for over a week. Please respond with a new
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be

marked

"Returned with Feedback".

Thank you for reminder!

I've just done so. Please resubmit once updated, it's a cool feature.

Thank you!
I already sent version of patch after David's reminder.
Please find rebased patch in the attachment.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-5.patchapplication/octet-stream; name=incremental-sort-5.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 1a9e6c8..c27b63e
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1913,1951 ****
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
   Limit
!    Output: t1.c1, t2.c1
     ->  Sort
!          Output: t1.c1, t2.c1
!          Sort Key: t1.c1, t2.c1
           ->  Nested Loop
!                Output: t1.c1, t2.c1
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c1
!                      Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c1
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c1
!                            Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!  c1 | c1  
! ----+-----
!   1 | 101
!   1 | 102
!   1 | 103
!   1 | 104
!   1 | 105
!   1 | 106
!   1 | 107
!   1 | 108
!   1 | 109
!   1 | 110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
--- 1913,1951 ----
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!                             QUERY PLAN                            
! ------------------------------------------------------------------
   Limit
!    Output: t1.c3, t2.c3
     ->  Sort
!          Output: t1.c3, t2.c3
!          Sort Key: t1.c3, t2.c3
           ->  Nested Loop
!                Output: t1.c3, t2.c3
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c3
!                      Remote SQL: SELECT c3 FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c3
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c3
!                            Remote SQL: SELECT c3 FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!   c3   |  c3   
! -------+-------
!  00001 | 00101
!  00001 | 00102
!  00001 | 00103
!  00001 | 00104
!  00001 | 00105
!  00001 | 00106
!  00001 | 00107
!  00001 | 00108
!  00001 | 00109
!  00001 | 00110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2487,2504 ****
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                 QUERY PLAN                                                
! ----------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          ->  Foreign Scan
!                Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
!                Relations: Aggregate on (public.ft1)
!                Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
--- 2487,2507 ----
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                         QUERY PLAN                                                        
! --------------------------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Incremental Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          Presorted Key: ft1.c2
!          ->  GroupAggregate
!                Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
!                Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
!                ->  Foreign Scan on public.ft1
!                      Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
!                      Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index cf70ca2..94e0b3d
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 479,486 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 479,486 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index ac339fb..59763ab
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3563,3568 ****
--- 3563,3582 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+       <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of incremental sort
+         steps. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
        <term><varname>enable_indexscan</varname> (<type>boolean</type>)
        <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index a18ab43..1eb3f0d
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual, 
*** 80,85 ****
--- 80,87 ----
  				ExplainState *es);
  static void show_sort_keys(SortState *sortstate, List *ancestors,
  			   ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 					   List *ancestors, ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
  static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
  				 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 									   ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  					ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1003,1008 ****
--- 1007,1015 ----
  		case T_Sort:
  			pname = sname = "Sort";
  			break;
+ 		case T_IncrementalSort:
+ 			pname = sname = "Incremental Sort";
+ 			break;
  		case T_Group:
  			pname = sname = "Group";
  			break;
*************** ExplainNode(PlanState *planstate, List *
*** 1576,1581 ****
--- 1583,1594 ----
  			show_sort_keys(castNode(SortState, planstate), ancestors, es);
  			show_sort_info(castNode(SortState, planstate), es);
  			break;
+ 		case T_IncrementalSort:
+ 			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ 									   ancestors, es);
+ 			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ 									   es);
+ 			break;
  		case T_MergeAppend:
  			show_merge_append_keys(castNode(MergeAppendState, planstate),
  								   ancestors, es);
*************** static void
*** 1901,1915 ****
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
   * Likewise, for a MergeAppend node.
   */
  static void
--- 1914,1951 ----
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+ 	int			skipCols;
+ 
+ 	if (IsA(plan, IncrementalSort))
+ 		skipCols = ((IncrementalSort *) plan)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
+  * Show the sort keys for a IncrementalSort node.
+  */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 						   List *ancestors, ExplainState *es)
+ {
+ 	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+ 
+ 	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ 						 plan->sort.numCols, plan->skipCols,
+ 						 plan->sort.sortColIdx,
+ 						 plan->sort.sortOperators, plan->sort.collations,
+ 						 plan->sort.nullsFirst,
+ 						 ancestors, es);
+ }
+ 
+ /*
   * Likewise, for a MergeAppend node.
   */
  static void
*************** show_merge_append_keys(MergeAppendState 
*** 1919,1925 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1955,1961 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1943,1949 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1979,1985 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 2012,2018 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 2048,2054 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2069,2075 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 2105,2111 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2082,2094 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 2118,2131 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2128,2136 ****
--- 2165,2177 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2288,2293 ****
--- 2329,2371 ----
  }
  
  /*
+  * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+  */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 						   ExplainState *es)
+ {
+ 	if (es->analyze && incrsortstate->sort_Done &&
+ 		incrsortstate->tuplesortstate != NULL)
+ 	{
+ 		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ 		const char *sortMethod;
+ 		const char *spaceType;
+ 		long		spaceUsed;
+ 
+ 		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+ 							 sortMethod, spaceType, spaceUsed);
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort groups: %ld\n",
+ 							 incrsortstate->groupsCount);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyText("Sort Method", sortMethod, es);
+ 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			ExplainPropertyLong("Sort groups: %ld",
+ 								incrsortstate->groupsCount, es);
+ 		}
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 083b20f..b093618
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
!        nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
!        nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 7e85c66..e7fd9f9
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 248,253 ****
--- 249,258 ----
  			ExecReScanSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecReScanIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecReScanGroup((GroupState *) node);
  			break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 519,526 ****
--- 524,535 ----
  		case T_CteScan:
  		case T_Material:
  		case T_Sort:
+ 			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_IncrementalSort:
+ 			return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 486ddf1..2f4a23a
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 93,98 ****
--- 93,99 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 308,313 ****
--- 309,319 ----
  												estate, eflags);
  			break;
  
+ 		case T_IncrementalSort:
+ 			result = (PlanState *) ExecInitIncrementalSort(
+ 									(IncrementalSort *) node, estate, eflags);
+ 			break;
+ 
  		case T_Group:
  			result = (PlanState *) ExecInitGroup((Group *) node,
  												 estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 531,536 ****
--- 537,546 ----
  			result = ExecSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			result = ExecIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			result = ExecGroup((GroupState *) node);
  			break;
*************** ExecEndNode(PlanState *node)
*** 803,808 ****
--- 813,822 ----
  			ExecEndSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecEndIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecEndGroup((GroupState *) node);
  			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index ef35da6..afb5cb2
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 655,660 ****
--- 655,661 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 733,739 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 734,740 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...5aa2c62
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,527 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncremenalSort.c
+  *	  Routines to handle incremental sorting of relations.
+  *
+  * DESCRIPTION
+  *
+  *		Incremental sort is specially optimized kind of multikey sort when
+  *		input is already presorted by prefix of required keys list.  Thus,
+  *		when it's required to sort by (key1, key2 ... keyN) and result is
+  *		already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+  *		values of (key1, key2 ... keyM) are equal.
+  *
+  *		Consider following example.  We have input tuples consisting from
+  *		two integers (x, y) already presorted by x, while it's required to
+  *		sort them by x and y.  Let input tuples be following.
+  *
+  *		(1, 5)
+  *		(1, 2)
+  *		(2, 10)
+  *		(2, 1)
+  *		(2, 5)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort algorithm would sort by xfollowing groups, which have
+  *		equal x, individually:
+  *			(1, 5) (1, 2)
+  *			(2, 10) (2, 1) (2, 5)
+  *			(3, 3) (3, 7)
+  *
+  *		After sorting these groups and putting them altogether, we would get
+  *		following tuple set which is actually sorted by x and y.
+  *
+  *		(1, 2)
+  *		(1, 5)
+  *		(2, 1)
+  *		(2, 5)
+  *		(2, 10)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort is faster than full sort on large datasets.  But
+  *		the case of most huge benefit of incremental sort is queries with
+  *		LIMIT because incremental sort can return first tuples without reading
+  *		whole input dataset.
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/executor/nodeIncremenalSort.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+ 
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ 															TupleTableSlot *b)
+ {
+ 	int n, i;
+ 
+ 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+ 
+ 	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = slot_getattr(a, attno, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	int					skipCols,
+ 						i;
+ 
+ 	Assert(IsA(plannode, IncrementalSort));
+ 	skipCols = plannode->skipCols;
+ 
+ 	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sort.sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 										plannode->sort.sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sort.sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->sort.collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
+ 
+ /* ----------------------------------------------------------------
+  *		ExecIncrementalSort
+  *
+  *		Assuming that outer subtree returns tuple presorted by some prefix
+  *		of target sort columns, performs incremental sort.  It fetches
+  *		groups of tuples where prefix sort columns are equal and sorts them
+  *		using tuplesort.  This approach allows to evade sorting of whole
+  *		dataset.  Besides taking less memory and being faster, it allows to
+  *		start returning tuples before fetching full dataset from outer
+  *		subtree.
+  *
+  *		Conditions:
+  *		  -- none.
+  *
+  *		Initial States:
+  *		  -- the outer child is prepared to return the first tuple.
+  * ----------------------------------------------------------------
+  */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ 	EState			   *estate;
+ 	ScanDirection		dir;
+ 	Tuplesortstate	   *tuplesortstate;
+ 	TupleTableSlot	   *slot;
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	PlanState		   *outerNode;
+ 	int					skipCols;
+ 	TupleDesc			tupDesc;
+ 	int64				nTuples = 0;
+ 
+ 	skipCols = plannode->skipCols;
+ 
+ 	/*
+ 	 * get state info from node
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "entering routine");
+ 
+ 	estate = node->ss.ps.state;
+ 	dir = estate->es_direction;
+ 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+ 
+ 	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
+ 	 * If first time through, read all tuples from outer plan and pass them to
+ 	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ 	 */
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "sorting subplan");
+ 
+ 	/*
+ 	 * Want to scan subplan in the forward direction while creating the
+ 	 * sorted data.
+ 	 */
+ 	estate->es_direction = ForwardScanDirection;
+ 
+ 	/*
+ 	 * Initialize tuplesort module.
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "calling tuplesort_begin");
+ 
+ 	outerNode = outerPlanState(node);
+ 	tupDesc = ExecGetResultType(outerNode);
+ 
+ 	if (node->tuplesortstate == NULL)
+ 	{
+ 		/*
+ 		 * We are going to process the first group of presorted data.
+ 		 * Initialize support structures for cmpSortSkipCols - already
+ 		 * sorted columns.
+ 		 */
+ 		prepareSkipCols(node);
+ 
+ 		/*
+ 		 * Only pass on remaining columns that are unsorted.  Skip
+ 		 * abbreviated keys usage for incremental sort.  We unlikely will
+ 		 * have huge groups with incremental sort.  Therefore usage of
+ 		 * abbreviated keys would be likely a waste of time.
+ 		 */
+ 		tuplesortstate = tuplesort_begin_heap(
+ 									tupDesc,
+ 									plannode->sort.numCols - skipCols,
+ 									&(plannode->sort.sortColIdx[skipCols]),
+ 									&(plannode->sort.sortOperators[skipCols]),
+ 									&(plannode->sort.collations[skipCols]),
+ 									&(plannode->sort.nullsFirst[skipCols]),
+ 									work_mem,
+ 									false,
+ 									true);
+ 		node->tuplesortstate = (void *) tuplesortstate;
+ 		node->groupsCount++;
+ 	}
+ 	else
+ 	{
+ 		/* Next group of presorted data */
+ 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ 		node->groupsCount++;
+ 	}
+ 
+ 	/* Calculate remaining bound for bounded sort */
+ 	if (node->bounded)
+ 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ 
+ 	/*
+ 	 * Put next group of tuples where skipCols sort values are equal to
+ 	 * tuplesort.
+ 	 */
+ 	for (;;)
+ 	{
+ 		slot = ExecProcNode(outerNode);
+ 
+ 		/* Put next group of presorted data to the tuplesort */
+ 		if (TupIsNull(node->prevSlot))
+ 		{
+ 			/* First tuple */
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			else
+ 			{
+ 				ExecCopySlot(node->prevSlot, slot);
+ 			}
+ 		}
+ 		else
+ 		{
+ 			/* Put previous tuple into tuplesort */
+ 			tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ 			nTuples++;
+ 
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			else
+ 			{
+ 				bool	cmp;
+ 				cmp = cmpSortSkipCols(node, node->prevSlot, slot);
+ 
+ 				/* Replace previous tuple with current one */
+ 				ExecCopySlot(node->prevSlot, slot);
+ 
+ 				/*
+ 				 * When skipCols are not equal then group of presorted data
+ 				 * is finished
+ 				 */
+ 				if (!cmp)
+ 					break;
+ 			}
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Complete the sort.
+ 	 */
+ 	tuplesort_performsort(tuplesortstate);
+ 
+ 	/*
+ 	 * restore to user specified direction
+ 	 */
+ 	estate->es_direction = dir;
+ 
+ 	/*
+ 	 * finally set the sorted flag to true
+ 	 */
+ 	node->sort_Done = true;
+ 	node->bounded_Done = node->bounded;
+ 
+ 	/*
+ 	 * Adjust bound_Done with number of tuples we've actually sorted.
+ 	 */
+ 	if (node->bounded)
+ 	{
+ 		if (node->finished)
+ 			node->bound_Done = node->bound;
+ 		else
+ 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ 	}
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "retrieving tuple from tuplesort");
+ 
+ 	/*
+ 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+ 	 * tuples.
+ 	 */
+ 	slot = node->ss.ps.ps_ResultTupleSlot;
+ 	(void) tuplesort_gettupleslot(tuplesortstate,
+ 								  ScanDirectionIsForward(dir),
+ 								  slot, NULL);
+ 	return slot;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitIncrementalSort
+  *
+  *		Creates the run-time state information for the sort node
+  *		produced by the planner and initializes its outer subtree.
+  * ----------------------------------------------------------------
+  */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ 	IncrementalSortState   *incrsortstate;
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "initializing sort node");
+ 
+ 	/*
+ 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ 	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ 	 * bucket in tuplesortstate.
+ 	 */
+ 	Assert((eflags & (EXEC_FLAG_REWIND |
+ 					  EXEC_FLAG_BACKWARD |
+ 					  EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
+ 	 * create state structure
+ 	 */
+ 	incrsortstate = makeNode(IncrementalSortState);
+ 	incrsortstate->ss.ps.plan = (Plan *) node;
+ 	incrsortstate->ss.ps.state = estate;
+ 
+ 	incrsortstate->bounded = false;
+ 	incrsortstate->sort_Done = false;
+ 	incrsortstate->finished = false;
+ 	incrsortstate->tuplesortstate = NULL;
+ 	incrsortstate->prevSlot = NULL;
+ 	incrsortstate->bound_Done = 0;
+ 	incrsortstate->groupsCount = 0;
+ 	incrsortstate->skipKeys = NULL;
+ 
+ 	/*
+ 	 * Miscellaneous initialization
+ 	 *
+ 	 * Sort nodes don't initialize their ExprContexts because they never call
+ 	 * ExecQual or ExecProject.
+ 	 */
+ 
+ 	/*
+ 	 * tuple table initialization
+ 	 *
+ 	 * sort nodes only return scan tuples from their sorted relation.
+ 	 */
+ 	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ 	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+ 
+ 	/*
+ 	 * initialize child nodes
+ 	 *
+ 	 * We shield the child node from the need to support REWIND, BACKWARD, or
+ 	 * MARK/RESTORE.
+ 	 */
+ 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+ 
+ 	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+ 
+ 	/*
+ 	 * initialize tuple type.  no need to initialize projection info because
+ 	 * this node doesn't do projections.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ 	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+ 
+ 	/* make standalone slot to store previous tuple from outer node */
+ 	incrsortstate->prevSlot = MakeSingleTupleTableSlot(
+ 							ExecGetResultType(outerPlanState(incrsortstate)));
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "sort node initialized");
+ 
+ 	return incrsortstate;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecEndIncrementalSort(node)
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "shutting down sort node");
+ 
+ 	/*
+ 	 * clean out the tuple table
+ 	 */
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	/* must drop stanalone tuple slot from outer node */
+ 	ExecDropSingleTupleTableSlot(node->prevSlot);
+ 
+ 	/*
+ 	 * Release tuplesort resources
+ 	 */
+ 	if (node->tuplesortstate != NULL)
+ 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 
+ 	/*
+ 	 * shut down the subplan
+ 	 */
+ 	ExecEndNode(outerPlanState(node));
+ 
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "sort node shutdown");
+ }
+ 
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ 	PlanState  *outerPlan = outerPlanState(node);
+ 
+ 	/*
+ 	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ 	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ 	 * re-scan it at all.
+ 	 */
+ 	if (!node->sort_Done)
+ 		return;
+ 
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 
+ 	/*
+ 	 * If subnode is to be rescanned then we forget previous sort results; we
+ 	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+ 	 * bounded-sort parameters changed or we didn't select randomAccess.
+ 	 *
+ 	 * Otherwise we can just rewind and rescan the sorted output.
+ 	 */
+ 	node->sort_Done = false;
+ 	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 	node->bound_Done = 0;
+ 
+ 	/*
+ 	 * if chgParam of subnode is not null then plan will be re-scanned by
+ 	 * first ExecProcNode.
+ 	 */
+ 	if (outerPlan->chgParam == NULL)
+ 		ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 591a31a..cf228d6
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 61bc502..0d6f628
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 910,915 ****
--- 910,933 ----
  
  
  /*
+  * CopySortFields
+  *
+  *		This function copies the fields of the Sort node.  It is used by
+  *		all the copy functions for classes which inherit from Sort.
+  */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ 
+ 	COPY_SCALAR_FIELD(numCols);
+ 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+ 
+ /*
   * _copySort
   */
  static Sort *
*************** _copySort(const Sort *from)
*** 920,932 ****
  	/*
  	 * copy node superclass fields
  	 */
! 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
! 	COPY_SCALAR_FIELD(numCols);
! 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
  
  	return newnode;
  }
--- 938,966 ----
  	/*
  	 * copy node superclass fields
  	 */
! 	CopySortFields(from, newnode);
  
! 	return newnode;
! }
! 
! 
! /*
!  * _copyIncrementalSort
!  */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! 	IncrementalSort	   *newnode = makeNode(IncrementalSort);
! 
! 	/*
! 	 * copy node superclass fields
! 	 */
! 	CopySortFields((const Sort *) from, (Sort *) newnode);
! 
! 	/*
! 	 * copy remainder of node
! 	 */
! 	COPY_SCALAR_FIELD(skipCols);
  
  	return newnode;
  }
*************** copyObjectImpl(const void *from)
*** 4758,4763 ****
--- 4792,4800 ----
  		case T_Sort:
  			retval = _copySort(from);
  			break;
+ 		case T_IncrementalSort:
+ 			retval = _copyIncrementalSort(from);
+ 			break;
  		case T_Group:
  			retval = _copyGroup(from);
  			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index 766ca49..c371afc
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 836,847 ****
  }
  
  static void
! _outSort(StringInfo str, const Sort *node)
  {
  	int			i;
  
- 	WRITE_NODE_TYPE("SORT");
- 
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
--- 836,845 ----
  }
  
  static void
! _outSortInfo(StringInfo str, const Sort *node)
  {
  	int			i;
  
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 864,869 ****
--- 862,885 ----
  }
  
  static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ 	WRITE_NODE_TYPE("SORT");
+ 
+ 	_outSortInfo(str, node);
+ }
+ 
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ 	WRITE_NODE_TYPE("INCREMENTALSORT");
+ 
+ 	_outSortInfo(str, (const Sort *) node);
+ 
+ 	WRITE_INT_FIELD(skipCols);
+ }
+ 
+ static void
  _outUnique(StringInfo str, const Unique *node)
  {
  	int			i;
*************** outNode(StringInfo str, const void *obj)
*** 3677,3682 ****
--- 3693,3701 ----
  			case T_Sort:
  				_outSort(str, obj);
  				break;
+ 			case T_IncrementalSort:
+ 				_outIncrementalSort(str, obj);
+ 				break;
  			case T_Unique:
  				_outUnique(str, obj);
  				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 766f2d8..5e487d4
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2032,2043 ****
  }
  
  /*
!  * _readSort
   */
! static Sort *
! _readSort(void)
  {
! 	READ_LOCALS(Sort);
  
  	ReadCommonPlan(&local_node->plan);
  
--- 2032,2044 ----
  }
  
  /*
!  * ReadCommonSort
!  *	Assign the basic stuff of all nodes that inherit from Sort
   */
! static void
! ReadCommonSort(Sort *local_node)
  {
! 	READ_TEMP_LOCALS();
  
  	ReadCommonPlan(&local_node->plan);
  
*************** _readSort(void)
*** 2046,2051 ****
--- 2047,2078 ----
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
  	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+ 
+ /*
+  * _readSort
+  */
+ static Sort *
+ _readSort(void)
+ {
+ 	READ_LOCALS_NO_FIELDS(Sort);
+ 
+ 	ReadCommonSort(local_node);
+ 
+ 	READ_DONE();
+ }
+ 
+ /*
+  * _readIncrementalSort
+  */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ 	READ_LOCALS(IncrementalSort);
+ 
+ 	ReadCommonSort(&local_node->sort);
+ 
+ 	READ_INT_FIELD(skipCols);
  
  	READ_DONE();
  }
*************** parseNodeString(void)
*** 2598,2603 ****
--- 2625,2632 ----
  		return_value = _readMaterial();
  	else if (MATCH("SORT", 4))
  		return_value = _readSort();
+ 	else if (MATCH("INCREMENTALSORT", 7))
+ 		return_value = _readIncrementalSort();
  	else if (MATCH("GROUP", 5))
  		return_value = _readGroup();
  	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 343b35a..2191634
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3279,3284 ****
--- 3279,3288 ----
  			ptype = "Sort";
  			subpath = ((SortPath *) path)->subpath;
  			break;
+ 		case T_IncrementalSortPath:
+ 			ptype = "IncrementalSort";
+ 			subpath = ((SortPath *) path)->subpath;
+ 			break;
  		case T_GroupPath:
  			ptype = "Group";
  			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index ed07e2f..eb17370
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
  bool		enable_bitmapscan = true;
  bool		enable_tidscan = true;
  bool		enable_sort = true;
+ bool		enable_incrementalsort = true;
  bool		enable_hashagg = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
*************** cost_recursive_union(Path *runion, Path 
*** 1600,1605 ****
--- 1601,1613 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or incremental sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1626,1632 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1634,1641 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1642,1660 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
  
  	path->rows = tuples;
  
--- 1651,1678 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
+ 	if (!enable_incrementalsort)
+ 		presorted_keys = false;
  
  	path->rows = tuples;
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1680,1692 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1698,1747 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1696,1702 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1751,1757 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1707,1716 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1762,1771 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1718,1731 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1773,1805 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/*
! 		 * We'll use plain quicksort on all the input tuples.  If it appears
! 		 * that we expect less than two tuples per sort group then assume
! 		 * logarithmic part of estimate to be 1.
! 		 */
! 		if (group_tuples >= 2.0)
! 			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! 		else
! 			group_cost = comparison_cost * group_tuples;
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1736,1741 ****
--- 1810,1828 ----
  	 */
  	run_cost += cpu_operator_cost * tuples;
  
+ 	/* Extra costs of incremental sort */
+ 	if (presorted_keys > 0)
+ 	{
+ 		/*
+ 		 * In incremental sort case we also have to cost to detect sort groups.
+ 		 * It turns out into extra copy and comparison for each tuple.
+ 		 */
+ 		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+ 
+ 		/* Cost of per group tuplesort reset */
+ 		run_cost += 10.0 * cpu_tuple_cost * num_groups;
+ 	}
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2489,2494 ****
--- 2576,2583 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2515,2520 ****
--- 2604,2611 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 2c26906..2da6f40
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
  	return PATHKEYS_EQUAL;
  }
  
+ 
+ /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
  /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 402,413 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied only partially then we would have to do
!  *	  incremental sort in order to satisfy pathkeys completely.  Since
!  *	  incremental sort consumes data by presorted groups, we would have to
!  *	  consume more data than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
! 	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
- 
- 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1521,1562 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by incremental sort.
   */
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
  {
! 	int	n_common_pathkeys;
! 
! 	if (query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
! 
! 	if (enable_incrementalsort)
  	{
! 		/*
! 		 * Return the number of path keys in common, or 0 if there are none. Any
! 		 * first common pathkeys could be useful for ordering because we can use
! 		 * incremental sort.
! 		 */
! 		return n_common_pathkeys;
! 	}
! 	else
! 	{
! 		/* 
! 		 * When incremental sort is disabled, pathkeys are useful only when they
! 		 * do contain all the query pathkeys.
! 		 */
! 		if (n_common_pathkeys == list_length(query_pathkeys))
! 			return n_common_pathkeys;
! 		else
! 			return 0;
  	}
  }
  
  /*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
--- 1572,1578 ----
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 2a78595..bbe776f
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 236,242 ****
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 236,242 ----
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 251,260 ****
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 251,262 ----
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 436,441 ****
--- 438,444 ----
  											   (GatherPath *) best_path);
  			break;
  		case T_Sort:
+ 		case T_IncrementalSort:
  			plan = (Plan *) create_sort_plan(root,
  											 (SortPath *) best_path,
  											 flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1099,1104 ****
--- 1102,1108 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1133,1141 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1137,1147 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1483,1488 ****
--- 1489,1495 ----
  	Plan	   *subplan;
  	List	   *pathkeys = best_path->path.pathkeys;
  	List	   *tlist = build_path_tlist(root, &best_path->path);
+ 	int			n_common_pathkeys;
  
  	/* As with Gather, it's best to project away columns in the workers. */
  	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1509,1520 ****
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
--- 1516,1531 ----
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! 	if (n_common_pathkeys < list_length(pathkeys))
! 	{
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ 									 n_common_pathkeys,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
+ 	}
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1622,1627 ****
--- 1633,1639 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1631,1637 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1643,1653 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1875,1881 ****
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan);
  			}
  
  			if (!rollup->is_hashed)
--- 1891,1898 ----
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan,
! 											 0);
  			}
  
  			if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3806,3813 ****
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3823,3836 ----
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3818,3825 ****
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3841,3854 ----
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4871,4877 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4900,4907 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5451,5463 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node = makeNode(Sort);
! 	Plan	   *plan = &node->plan;
  
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
--- 5481,5511 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node;
! 	Plan	   *plan;
  
+ 	/* Always use regular sort node when enable_incrementalsort = false */
+ 	if (!enable_incrementalsort)
+ 		skipCols = 0;
+ 
+ 	if (skipCols == 0)
+ 	{
+ 		node = makeNode(Sort);
+ 	}
+ 	else
+ 	{
+ 		IncrementalSort    *incrementalSort;
+ 
+ 		incrementalSort = makeNode(IncrementalSort);
+ 		node = &incrementalSort->sort;
+ 		incrementalSort->skipCols = skipCols;
+ 	}
+ 
+ 	plan = &node->plan;
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5789,5795 ****
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5837,5843 ----
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5809,5815 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5857,5863 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5852,5858 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5900,5906 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5873,5879 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5921,5928 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5906,5912 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5955,5961 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** is_projection_capable_plan(Plan *plan)
*** 6555,6560 ****
--- 6604,6610 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 5565736..eaf7a78
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index f99257b..09338c7
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3751,3764 ****
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				bool		is_sorted;
  
! 				is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 												  path->pathkeys);
! 				if (path == cheapest_partial_path || is_sorted)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (!is_sorted)
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
--- 3751,3764 ----
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				int			n_useful_pathkeys;
  
! 				n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (n_useful_pathkeys < list_length(root->group_pathkeys))
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3831,3844 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3831,3844 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_useful_pathkeys;
  
! 			n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 			if (path == cheapest_path || n_useful_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4905,4917 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 4905,4917 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_useful_pathkeys;
  
! 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! 														 path->pathkeys);
! 		if (path == cheapest_input_path || n_useful_pathkeys > 0)
  		{
! 			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 6040,6047 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 6040,6048 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index cdb8e95..420d752
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 634,639 ****
--- 634,640 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 87cc44d..25fac59
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2702,2707 ****
--- 2702,2708 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_Gather:
  		case T_GatherMerge:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index e327e66..b2b8440
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 963,969 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 963,970 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 8536212..a99a1a7
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1297,1308 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1297,1309 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1316,1321 ****
--- 1317,1324 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1552,1558 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1555,1562 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1644,1649 ****
--- 1648,1654 ----
  	GatherMergePath *pathnode = makeNode(GatherMergePath);
  	Cost			 input_startup_cost = 0;
  	Cost			 input_total_cost = 0;
+ 	int				 n_common_pathkeys;
  
  	Assert(subpath->parallel_safe);
  	Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1660,1666 ****
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
--- 1665,1673 ----
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 
! 	if (n_common_pathkeys == list_length(pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1674,1679 ****
--- 1681,1688 ----
  		cost_sort(&sort_path,
  				  root,
  				  pathkeys,
+ 				  n_common_pathkeys,
+ 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  subpath->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2516,2524 ****
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode = makeNode(SortPath);
  
- 	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
--- 2525,2555 ----
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode;
! 	int			n_common_pathkeys;
! 
! 	if (enable_incrementalsort)
! 		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! 	else
! 		n_common_pathkeys = 0;
! 
! 	if (n_common_pathkeys == 0)
! 	{
! 		pathnode = makeNode(SortPath);
! 		pathnode->path.pathtype = T_Sort;
! 	}
! 	else
! 	{
! 		IncrementalSortPath   *incpathnode;
! 
! 		incpathnode = makeNode(IncrementalSortPath);
! 		pathnode = &incpathnode->spath;
! 		pathnode->path.pathtype = T_IncrementalSort;
! 		incpathnode->skipCols = n_common_pathkeys;
! 	}
! 
! 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2532,2538 ****
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2563,2571 ----
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2840,2846 ****
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
--- 2873,2880 ----
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL, 0,
! 						  0.0,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index e462fbd..fb54f27
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 277,283 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 5c382a2..6426e44
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3567,3572 ****
--- 3567,3608 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucketsize fraction (ie, number of entries in a bucket
   * divided by total tuples in relation) if the specified expression is used
   * as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 8b5f064..780d3b7
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 859,864 ****
--- 859,873 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of incremental sort steps."),
+ 			NULL
+ 		},
+ 		&enable_incrementalsort,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hashed aggregation plans."),
  			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index e1e692d..ed189c2
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 281,286 ****
--- 281,293 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	int64		maxSpace;		/* maximum amount of space occupied among sort
+ 								   of groups, either in-memory or on-disk */
+ 	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+ 								   space, fase when it's value for in-memory
+ 								   space */
+ 	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 633,638 ****
--- 640,648 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 667,685 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 677,706 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 696,702 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 717,723 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 714,719 ****
--- 735,741 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 754,766 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 776,789 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 802,808 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 825,831 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 833,839 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 856,862 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 924,930 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 947,953 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 997,1003 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1020,1026 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1034,1040 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1057,1063 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1145,1160 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1168,1179 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1213,1219 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1232,1329 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	spaceUsed;
! 	bool	spaceUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		spaceUsedOnDisk = true;
! 		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		spaceUsedOnDisk = false;
! 		spaceUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	if (spaceUsed > state->maxSpace)
! 	{
! 		state->maxSpace = spaceUsed;
! 		state->maxSpaceOnDisk = spaceUsedOnDisk;
! 		state->maxSpaceStatus = state->status;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->availMem = state->allowedMem;
! 	state->lastReturnedTuple = NULL;
! 	state->slabAllocatorUsed = false;
! 	state->slabMemoryBegin = NULL;
! 	state->slabMemoryEnd = NULL;
! 	state->slabFreeHead = NULL;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3219,3245 ****
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
  		*spaceType = "Disk";
- 		*spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		*spaceType = "Memory";
! 		*spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3329,3343 ----
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxSpaceOnDisk)
  		*spaceType = "Disk";
  	else
  		*spaceType = "Memory";
! 	*spaceUsed = (state->maxSpace + 1023) / 1024;
  
! 	switch (state->maxSpaceStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncrementalSort.h
+  *
+  *
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/executor/nodeIncrementalSort.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+ 
+ #include "nodes/execnodes.h"
+ 
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ 													EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+ 
+ #endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index fa99244..0e59187
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1676,1681 ****
--- 1676,1695 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1692,1697 ****
--- 1706,1731 ----
  	void	   *tuplesortstate; /* private state of tuplesort.c */
  } SortState;
  
+ /* ----------------
+  *	 IncrementalSortState information
+  * ----------------
+  */
+ typedef struct IncrementalSortState
+ {
+ 	ScanState	ss;				/* its first field is NodeTag */
+ 	bool		bounded;		/* is the result set bounded? */
+ 	int64		bound;			/* if bounded, how many tuples are needed */
+ 	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
+ 	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	int64		bound_Done;		/* value of bound we did the sort with */
+ 	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	int64		groupsCount;	/* number of groups with equal skip keys */
+ 	TupleTableSlot *prevSlot;	/* slot for previous tuple from outer node */
+ } IncrementalSortState;
+ 
  /* ---------------------
   *	GroupState information
   * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index 177853b..cf64e29
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
  	T_HashJoin,
  	T_Material,
  	T_Sort,
+ 	T_IncrementalSort,
  	T_Group,
  	T_Agg,
  	T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
  	T_HashJoinState,
  	T_MaterialState,
  	T_SortState,
+ 	T_IncrementalSortState,
  	T_GroupState,
  	T_AggState,
  	T_WindowAggState,
*************** typedef enum NodeTag
*** 239,244 ****
--- 241,247 ----
  	T_ProjectionPath,
  	T_ProjectSetPath,
  	T_SortPath,
+ 	T_IncrementalSortPath,
  	T_GroupPath,
  	T_UpperUniquePath,
  	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index a2dd26f..05e4f82
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 730,735 ****
--- 730,746 ----
  	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
  } Sort;
  
+ 
+ /* ----------------
+  *		incremental sort node
+  * ----------------
+  */
+ typedef struct IncrementalSort
+ {
+ 	Sort		sort;
+ 	int			skipCols;		/* number of presorted columns */
+ } IncrementalSort;
+ 
  /* ---------------
   *	 group node -
   *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index ebf9480..dd0478d
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1372,1377 ****
--- 1372,1387 ----
  } SortPath;
  
  /*
+  * IncrementalSortPath
+  */
+ typedef struct IncrementalSortPath
+ {
+ 	SortPath	spath;
+ 	int			skipCols;
+ } IncrementalSortPath;
+ 
+ 
+ /*
   * GroupPath represents grouping (of presorted input)
   *
   * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 6909359..86dcdbb
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_bitmapscan;
  extern bool enable_tidscan;
  extern bool enable_sort;
+ extern bool enable_incrementalsort;
  extern bool enable_hashagg;
  extern bool enable_nestloop;
  extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 102,109 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 103,111 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 25fe78c..01073dd
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
  extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
  							  List *mergeclauses,
  							  List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
  extern List *truncate_useless_pathkeys(PlannerInfo *root,
  						  RelOptInfo *rel,
  						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
  						 double nbuckets);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5b3f475..616f9f5
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 62,69 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 104,109 ****
--- 105,112 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort           
*** 19,27 ****
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Sort           
    Sort Key: id, data
!   ->  Seq Scan on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
--- 19,28 ----
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Incremental Sort
    Sort Key: id, data
!   Presorted Key: id
!   ->  Index Scan using test_dc_pkey on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 6163ed8..9553648
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE:  drop cascades to table matest1
*** 1493,1498 ****
--- 1493,1499 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
  SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1633,1641 ****
--- 1634,1678 ----
   {3,7,8,10,13,13,16,18,19,22}
  (3 rows)
  
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+                                QUERY PLAN                                
+ -------------------------------------------------------------------------
+  Merge Append
+    Sort Key: tenk1.thousand, tenk1.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+    ->  Incremental Sort
+          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
+          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Merge Append
+    Sort Key: a.thousand, a.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+    ->  Incremental Sort
+          Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
+          ->  Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  --
  -- Check that constraint exclusion works correctly with partitions using
  -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!          name         | setting 
! ----------------------+---------
!  enable_bitmapscan    | on
!  enable_gathermerge   | on
!  enable_hashagg       | on
!  enable_hashjoin      | on
!  enable_indexonlyscan | on
!  enable_indexscan     | on
!  enable_material      | on
!  enable_mergejoin     | on
!  enable_nestloop      | on
!  enable_seqscan       | on
!  enable_sort          | on
!  enable_tidscan       | on
! (12 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
--- 70,91 ----
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!           name          | setting 
! ------------------------+---------
!  enable_bitmapscan      | on
!  enable_gathermerge     | on
!  enable_hashagg         | on
!  enable_hashjoin        | on
!  enable_incrementalsort | on
!  enable_indexonlyscan   | on
!  enable_indexscan       | on
!  enable_material        | on
!  enable_mergejoin       | on
!  enable_nestloop        | on
!  enable_seqscan         | on
!  enable_sort            | on
!  enable_tidscan         | on
! (13 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index d43b75c..ec611f5
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 527,532 ****
--- 527,533 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
*************** SELECT
*** 588,596 ****
--- 589,614 ----
      ORDER BY f.i LIMIT 10)
  FROM generate_series(1, 3) g(i);
  
+ set enable_incrementalsort = on;
+ 
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  
  --
  -- Check that constraint exclusion works correctly with partitions using

#12

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Alexander Korotkov (#11)

Re: [PATCH] Incremental sort

On April 3, 2017 12:03:56 PM PDT, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Mon, Apr 3, 2017 at 9:34 PM, Andres Freund <andres@anarazel.de>
wrote:

On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:

On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net>

wrote:

On 3/20/17 10:19 AM, Heikki Linnakangas wrote:

On 03/20/2017 11:33 AM, Alexander Korotkov wrote:

Please, find rebased patch in the attachment.

I had a quick look at this.

<...>

According to 'perf', 85% of the CPU time is spent in

ExecCopySlot(). To

alleviate that, it might be worthwhile to add a special case for

when

the group contains exactly one group, and not put the tuple to

the

tuplesort in that case. Or if we cannot ensure that the

Incremental

Sort

is actually faster, the cost model should probably be smarter,

to

avoid

picking an incremental sort when it's not a win.

This thread has been idle for over a week. Please respond with a

new

patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be

marked

"Returned with Feedback".

Thank you for reminder!

I've just done so. Please resubmit once updated, it's a cool

feature.

Thank you!
I already sent version of patch after David's reminder.
Please find rebased patch in the attachment.

Cool. I think that's still a bit late for v10?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 9 years ago

In reply to: Andres Freund (#12)

Re: [PATCH] Incremental sort

On Mon, Apr 3, 2017 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote:

On April 3, 2017 12:03:56 PM PDT, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Mon, Apr 3, 2017 at 9:34 PM, Andres Freund <andres@anarazel.de>
wrote:

On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:

On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net>

wrote:

On 3/20/17 10:19 AM, Heikki Linnakangas wrote:

On 03/20/2017 11:33 AM, Alexander Korotkov wrote:

Please, find rebased patch in the attachment.

I had a quick look at this.

<...>

According to 'perf', 85% of the CPU time is spent in

ExecCopySlot(). To

alleviate that, it might be worthwhile to add a special case for

when

the group contains exactly one group, and not put the tuple to

the

tuplesort in that case. Or if we cannot ensure that the

Incremental

Sort

is actually faster, the cost model should probably be smarter,

to

avoid

picking an incremental sort when it's not a win.

This thread has been idle for over a week. Please respond with a

new

patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be

marked

"Returned with Feedback".

Thank you for reminder!

I've just done so. Please resubmit once updated, it's a cool

feature.

Thank you!
I already sent version of patch after David's reminder.
Please find rebased patch in the attachment.

Cool. I think that's still a bit late for v10?

I don't know. ISTM, that I addressed all the issues raised by reviewers.
Also, this patch is pending since late 2013. It would be very nice to
finally get it in...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#14

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Alexander Korotkov (#13)

Re: [PATCH] Incremental sort

Hi,

On 2017-04-04 00:04:09 +0300, Alexander Korotkov wrote:

Thank you!
I already sent version of patch after David's reminder.
Please find rebased patch in the attachment.

Cool. I think that's still a bit late for v10?

I don't know. ISTM, that I addressed all the issues raised by reviewers.
Also, this patch is pending since late 2013. It would be very nice to
finally get it in...

To me this hasn't gotten even remotely enough performance evaluation.
And I don't think it's fair to characterize it as pending since 2013,
given it was essentially "waiting on author" for most of that.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 9 years ago

In reply to: Andres Freund (#14)

Re: [PATCH] Incremental sort

On Tue, Apr 4, 2017 at 12:09 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-04-04 00:04:09 +0300, Alexander Korotkov wrote:

Thank you!
I already sent version of patch after David's reminder.
Please find rebased patch in the attachment.

Cool. I think that's still a bit late for v10?

I don't know. ISTM, that I addressed all the issues raised by reviewers.
Also, this patch is pending since late 2013. It would be very nice to
finally get it in...

To me this hasn't gotten even remotely enough performance evaluation.

I'm ready to put my efforts on that.

And I don't think it's fair to characterize it as pending since 2013,

Probably, this duration isn't good characteristic at all.

given it was essentially "waiting on author" for most of that.

What makes you think so? Do you have some statistics? Or is it just
random assumption?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#16

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Andres Freund (#14)

Re: [PATCH] Incremental sort

On Mon, Apr 3, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:

To me this hasn't gotten even remotely enough performance evaluation.
And I don't think it's fair to characterize it as pending since 2013,
given it was essentially "waiting on author" for most of that.

This is undeniably a patch which has been kicking around for a lot of
time without getting a lot of attention, and if it just keeps getting
punted down the road, it's never going to become committable.
Alexander's questions upthread about what decisions the committer who
took an interest (Heikki) would prefer never really got an answer, for
example. I don't deny that there may be some work left to do here,
but I think blaming the author for a week's delay when this has been
ignored so often for so long is unfair.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Robert Haas (#16)

Re: [PATCH] Incremental sort

On 2017-04-03 22:18:21 -0400, Robert Haas wrote:

On Mon, Apr 3, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:

To me this hasn't gotten even remotely enough performance evaluation.
And I don't think it's fair to characterize it as pending since 2013,
given it was essentially "waiting on author" for most of that.

This is undeniably a patch which has been kicking around for a lot of
time without getting a lot of attention, and if it just keeps getting
punted down the road, it's never going to become committable.

Indeed, it's old. And it hasn't gotten enough timely feedback.

But I don't think the wait time can meaningfully be measured by
subtracting two dates:
The first version of the patch, as a PoC, has been posted 2013-12-14,
which then got a good amount of feedback & revisions, and then stalled
till 2014-07-12. There a few back-and forths yielded a new version.
From 2014-09-15 till 2015-10-16 the patch stalled, waiting on its
author. That version had open todos ([1]http://archives.postgresql.org/message-id/CAPpHfdvhwMsG69exCRUGK3ms-ng0PSPcucH5FU6tAaM-qL-1%2Bw%40mail.gmail.com), as had the version from
2016-03-13 [2]http://archives.postgresql.org/message-id/CAPpHfdvzjYGLTyA-8ib8UYnKLPrewd9Z%3DT4YJNCRWiHWHHweWw%40mail.gmail.com, which weren't addressed 2016-03-30 - unfortunately that
was pretty much when the tree was frozen. 2016-09-13 a rebased patch
was sent, some minor points were raised 2016-10-02 (unaddressed), a
larger review was done 2016-12-01 ([5]http://archives.postgresql.org/message-id/CA+TgmoZapyHRm7NVyuyZ+yAV=U1a070BOgRe7PkgyrAegR4JDA@mail.gmail.com), unaddressed till 2017-02-18.
At that point we're in this thread.

There's obviously some long waiting-on-author periods in there. And
some long needs-review periods.

Alexander's questions upthread about what decisions the committer who
took an interest (Heikki) would prefer never really got an answer, for
example. I don't deny that there may be some work left to do here,
but I think blaming the author for a week's delay when this has been
ignored so often for so long is unfair.

I'm not trying to blame Alexander for a week's worth of delay, at all.
It's just that, well, we're past the original code-freeze date, three
days before the "final" code freeze. I don't think fairness is something
we can achieve at this point :(. Given the risk of regressions -
demonstrated in this thread although partially adressed - and the very
limited amount of benchmarking done, it seems unlikely that this is
going to be merged.

Regards,

Andres

[1]: http://archives.postgresql.org/message-id/CAPpHfdvhwMsG69exCRUGK3ms-ng0PSPcucH5FU6tAaM-qL-1%2Bw%40mail.gmail.com
[2]: http://archives.postgresql.org/message-id/CAPpHfdvzjYGLTyA-8ib8UYnKLPrewd9Z%3DT4YJNCRWiHWHHweWw%40mail.gmail.com
[3]: http://archives.postgresql.org/message-id/CAPpHfdtCcHZ-mLWzsFrRCvHpV1LPSaOGooMZ3sa40AkwR=7ouQ@mail.gmail.com
[4]: http://archives.postgresql.org/message-id/CAPpHfdvj1Tdi2WA64ZbBp5-yG-uzaRXzk3K7J7zt-cRX6YSd0A@mail.gmail.com
[5]: http://archives.postgresql.org/message-id/CA+TgmoZapyHRm7NVyuyZ+yAV=U1a070BOgRe7PkgyrAegR4JDA@mail.gmail.com
[6]: http://archives.postgresql.org/message-id/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Alexander Korotkov

a.korotkov@postgrespro.ru

over 8 years ago

In reply to: Alexander Korotkov (#9)

1 attachment(s)

Re: [PATCH] Incremental sort

On Wed, Mar 29, 2017 at 5:14 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

I added to cost_sort() extra costing for incremental sort: cost of extra
tuple copying and comparing as well as cost of tuplesort reset.
The only problem is that I made following estimate for tuplesort reset:

run_cost += 10.0 * cpu_tuple_cost * num_groups;

It makes ordinal sort to be selected in your example, but it contains
constant 10 which is quite arbitrary. It would be nice to evade such hard
coded constants, but I don't know how could we calculate such cost
realistically.

That appears to be wrong. I intended to make cost_sort prefer plain sort
over incremental sort for this dataset size. But, that appears to be not
always right solution. Quick sort is so fast only on presorted data.
On my laptop I have following numbers for test case provided by Heikki.

Presorted data – very fast.

# explain select count(*) from (select * from sorttest order by a, c) as t;
QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=147154.34..147154.35 rows=1 width=8)
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: sorttest.a, sorttest.c
-> Seq Scan on sorttest (cost=0.00..15406.00 rows=1000000
width=12)
(4 rows)

# select count(*) from (select * from sorttest order by a, c) as t;
count
---------
1000000
(1 row)

Time: 260,752 ms

Not presorted data – not so fast. It's actually slower than incremental
sort was.

# explain select count(*) from (select * from sorttest order by a desc, c
desc) as t;
QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=130063.84..130063.85 rows=1 width=8)
-> Sort (cost=115063.84..117563.84 rows=1000000 width=12)
Sort Key: sorttest.a DESC, sorttest.c DESC
-> Seq Scan on sorttest (cost=0.00..15406.00 rows=1000000
width=12)
(4 rows)

# select count(*) from (select * from sorttest order by a desc, c desc) as
t;
count
---------
1000000
(1 row)

Time: 416,207 ms

Thus, it would be nice to reflect the fact that our quicksort
implementation is very fast on presorted data. Fortunately, we have
corresponding statistics: STATISTIC_KIND_CORRELATION. However, it probably
should be a subject of a separate patch.

But I'd like to make incremental sort not slower than quicksort in case of
presorted data. New idea about it comes to my mind. Since cause of
incremental sort slowness in this case is too frequent reset of tuplesort,
then what if we would artificially put data in larger groups. Attached
revision of patch implements this: it doesn't stop to accumulate tuples to
tuplesort until we have MIN_GROUP_SIZE tuples.

# explain select count(*) from (select * from sorttest order by a, c) as t;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Aggregate (cost=85412.43..85412.43 rows=1 width=8)
-> Incremental Sort (cost=0.46..72912.43 rows=1000000 width=12)
Sort Key: sorttest.a, sorttest.c
Presorted Key: sorttest.a
-> Index Only Scan using i_sorttest on sorttest
(cost=0.42..30412.42 rows=1000000 width=12)
(5 rows)

# select count(*) from (select * from sorttest order by a, c) as t;
count
---------
1000000
(1 row)

Time: 251,227 ms

# explain select count(*) from (select * from sorttest order by a desc, c
desc) as t;
QUERY PLAN
────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Aggregate (cost=85412.43..85412.43 rows=1 width=8)
-> Incremental Sort (cost=0.46..72912.43 rows=1000000 width=12)
Sort Key: sorttest.a DESC, sorttest.c DESC
Presorted Key: sorttest.a
-> Index Only Scan Backward using i_sorttest on sorttest
(cost=0.42..30412.42 rows=1000000 width=12)
(5 rows)

# select count(*) from (select * from sorttest order by a desc, c desc) as
t;
count
---------
1000000
(1 row)

Time: 253,270 ms

Now, incremental sort is not slower than quicksort. And this seems to be
cool.
However, in the LIMIT case we will pay the price of fetching some extra
tuples from outer node. But, that doesn't seem to hurt us too much.

# explain select * from sorttest order by a, c limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Limit (cost=0.46..0.84 rows=10 width=12)
-> Incremental Sort (cost=0.46..37500.78 rows=1000000 width=12)
Sort Key: a, c
Presorted Key: a
-> Index Only Scan using i_sorttest on sorttest
(cost=0.42..30412.42 rows=1000000 width=12)
(5 rows)

# select * from sorttest order by a, c limit 10;
a | b | c
----+----+----
1 | 1 | 1
2 | 2 | 2
3 | 3 | 3
4 | 4 | 4
5 | 5 | 5
6 | 6 | 6
7 | 7 | 7
8 | 8 | 8
9 | 9 | 9
10 | 10 | 10
(10 rows)

Time: 0,903 ms

Any thoughts?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-6.patchapplication/octet-stream; name=incremental-sort-6.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index d1bc5b0..c9de7ea
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1943,1981 ****
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
   Limit
!    Output: t1.c1, t2.c1
     ->  Sort
!          Output: t1.c1, t2.c1
!          Sort Key: t1.c1, t2.c1
           ->  Nested Loop
!                Output: t1.c1, t2.c1
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c1
!                      Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c1
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c1
!                            Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!  c1 | c1  
! ----+-----
!   1 | 101
!   1 | 102
!   1 | 103
!   1 | 104
!   1 | 105
!   1 | 106
!   1 | 107
!   1 | 108
!   1 | 109
!   1 | 110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
--- 1943,1981 ----
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!                             QUERY PLAN                            
! ------------------------------------------------------------------
   Limit
!    Output: t1.c3, t2.c3
     ->  Sort
!          Output: t1.c3, t2.c3
!          Sort Key: t1.c3, t2.c3
           ->  Nested Loop
!                Output: t1.c3, t2.c3
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c3
!                      Remote SQL: SELECT c3 FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c3
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c3
!                            Remote SQL: SELECT c3 FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!   c3   |  c3   
! -------+-------
!  00001 | 00101
!  00001 | 00102
!  00001 | 00103
!  00001 | 00104
!  00001 | 00105
!  00001 | 00106
!  00001 | 00107
!  00001 | 00108
!  00001 | 00109
!  00001 | 00110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2517,2534 ****
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                 QUERY PLAN                                                
! ----------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          ->  Foreign Scan
!                Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
!                Relations: Aggregate on (public.ft1)
!                Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
--- 2517,2537 ----
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                         QUERY PLAN                                                        
! --------------------------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Incremental Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          Presorted Key: ft1.c2
!          ->  GroupAggregate
!                Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
!                Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
!                ->  Foreign Scan on public.ft1
!                      Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
!                      Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 509bb54..263a646
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 487,494 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 487,494 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index e02b0c8..ad6b7d3
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3563,3568 ****
--- 3563,3582 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+       <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of incremental sort
+         steps. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
        <term><varname>enable_indexscan</varname> (<type>boolean</type>)
        <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 9359d0a..52987bb
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual, 
*** 80,85 ****
--- 80,87 ----
  				ExplainState *es);
  static void show_sort_keys(SortState *sortstate, List *ancestors,
  			   ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 					   List *ancestors, ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
  static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
  				 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 									   ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  					ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1003,1008 ****
--- 1007,1015 ----
  		case T_Sort:
  			pname = sname = "Sort";
  			break;
+ 		case T_IncrementalSort:
+ 			pname = sname = "Incremental Sort";
+ 			break;
  		case T_Group:
  			pname = sname = "Group";
  			break;
*************** ExplainNode(PlanState *planstate, List *
*** 1593,1598 ****
--- 1600,1611 ----
  			show_sort_keys(castNode(SortState, planstate), ancestors, es);
  			show_sort_info(castNode(SortState, planstate), es);
  			break;
+ 		case T_IncrementalSort:
+ 			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ 									   ancestors, es);
+ 			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ 									   es);
+ 			break;
  		case T_MergeAppend:
  			show_merge_append_keys(castNode(MergeAppendState, planstate),
  								   ancestors, es);
*************** static void
*** 1918,1932 ****
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
   * Likewise, for a MergeAppend node.
   */
  static void
--- 1931,1968 ----
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+ 	int			skipCols;
+ 
+ 	if (IsA(plan, IncrementalSort))
+ 		skipCols = ((IncrementalSort *) plan)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
+  * Show the sort keys for a IncrementalSort node.
+  */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 						   List *ancestors, ExplainState *es)
+ {
+ 	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+ 
+ 	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ 						 plan->sort.numCols, plan->skipCols,
+ 						 plan->sort.sortColIdx,
+ 						 plan->sort.sortOperators, plan->sort.collations,
+ 						 plan->sort.nullsFirst,
+ 						 ancestors, es);
+ }
+ 
+ /*
   * Likewise, for a MergeAppend node.
   */
  static void
*************** show_merge_append_keys(MergeAppendState 
*** 1936,1942 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1972,1978 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1960,1966 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1996,2002 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 2029,2035 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 2065,2071 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2086,2092 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 2122,2128 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2099,2111 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 2135,2148 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2145,2153 ****
--- 2182,2194 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2305,2310 ****
--- 2346,2388 ----
  }
  
  /*
+  * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+  */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 						   ExplainState *es)
+ {
+ 	if (es->analyze && incrsortstate->sort_Done &&
+ 		incrsortstate->tuplesortstate != NULL)
+ 	{
+ 		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ 		const char *sortMethod;
+ 		const char *spaceType;
+ 		long		spaceUsed;
+ 
+ 		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+ 							 sortMethod, spaceType, spaceUsed);
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort groups: %ld\n",
+ 							 incrsortstate->groupsCount);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyText("Sort Method", sortMethod, es);
+ 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			ExplainPropertyLong("Sort groups: %ld",
+ 								incrsortstate->groupsCount, es);
+ 		}
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 083b20f..b093618
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
!        nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
!        nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 7e85c66..e7fd9f9
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 248,253 ****
--- 249,258 ----
  			ExecReScanSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecReScanIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecReScanGroup((GroupState *) node);
  			break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 519,526 ****
--- 524,535 ----
  		case T_CteScan:
  		case T_Material:
  		case T_Sort:
+ 			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_IncrementalSort:
+ 			return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 486ddf1..2f4a23a
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 93,98 ****
--- 93,99 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 308,313 ****
--- 309,319 ----
  												estate, eflags);
  			break;
  
+ 		case T_IncrementalSort:
+ 			result = (PlanState *) ExecInitIncrementalSort(
+ 									(IncrementalSort *) node, estate, eflags);
+ 			break;
+ 
  		case T_Group:
  			result = (PlanState *) ExecInitGroup((Group *) node,
  												 estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 531,536 ****
--- 537,546 ----
  			result = ExecSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			result = ExecIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			result = ExecGroup((GroupState *) node);
  			break;
*************** ExecEndNode(PlanState *node)
*** 803,808 ****
--- 813,822 ----
  			ExecEndSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecEndIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecEndGroup((GroupState *) node);
  			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index c2b8618..551664c
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 655,660 ****
--- 655,661 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 736,742 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 737,743 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...052943b
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,546 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncremenalSort.c
+  *	  Routines to handle incremental sorting of relations.
+  *
+  * DESCRIPTION
+  *
+  *		Incremental sort is specially optimized kind of multikey sort when
+  *		input is already presorted by prefix of required keys list.  Thus,
+  *		when it's required to sort by (key1, key2 ... keyN) and result is
+  *		already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+  *		values of (key1, key2 ... keyM) are equal.
+  *
+  *		Consider following example.  We have input tuples consisting from
+  *		two integers (x, y) already presorted by x, while it's required to
+  *		sort them by x and y.  Let input tuples be following.
+  *
+  *		(1, 5)
+  *		(1, 2)
+  *		(2, 10)
+  *		(2, 1)
+  *		(2, 5)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort algorithm would sort by xfollowing groups, which have
+  *		equal x, individually:
+  *			(1, 5) (1, 2)
+  *			(2, 10) (2, 1) (2, 5)
+  *			(3, 3) (3, 7)
+  *
+  *		After sorting these groups and putting them altogether, we would get
+  *		following tuple set which is actually sorted by x and y.
+  *
+  *		(1, 2)
+  *		(1, 5)
+  *		(2, 1)
+  *		(2, 5)
+  *		(2, 10)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort is faster than full sort on large datasets.  But
+  *		the case of most huge benefit of incremental sort is queries with
+  *		LIMIT because incremental sort can return first tuples without reading
+  *		whole input dataset.
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/executor/nodeIncremenalSort.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+ 
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ 															TupleTableSlot *b)
+ {
+ 	int n, i;
+ 
+ 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+ 
+ 	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = slot_getattr(a, attno, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	int					skipCols,
+ 						i;
+ 
+ 	Assert(IsA(plannode, IncrementalSort));
+ 	skipCols = plannode->skipCols;
+ 
+ 	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sort.sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 										plannode->sort.sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sort.sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->sort.collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
+ 
+ #define MIN_GROUP_SIZE 32
+ 
+ /* ----------------------------------------------------------------
+  *		ExecIncrementalSort
+  *
+  *		Assuming that outer subtree returns tuple presorted by some prefix
+  *		of target sort columns, performs incremental sort.  It fetches
+  *		groups of tuples where prefix sort columns are equal and sorts them
+  *		using tuplesort.  This approach allows to evade sorting of whole
+  *		dataset.  Besides taking less memory and being faster, it allows to
+  *		start returning tuples before fetching full dataset from outer
+  *		subtree.
+  *
+  *		Conditions:
+  *		  -- none.
+  *
+  *		Initial States:
+  *		  -- the outer child is prepared to return the first tuple.
+  * ----------------------------------------------------------------
+  */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ 	EState			   *estate;
+ 	ScanDirection		dir;
+ 	Tuplesortstate	   *tuplesortstate;
+ 	TupleTableSlot	   *slot;
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	PlanState		   *outerNode;
+ 	int					skipCols;
+ 	TupleDesc			tupDesc;
+ 	int64				nTuples = 0;
+ 
+ 	skipCols = plannode->skipCols;
+ 
+ 	/*
+ 	 * get state info from node
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "entering routine");
+ 
+ 	estate = node->ss.ps.state;
+ 	dir = estate->es_direction;
+ 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+ 
+ 	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  false, slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
+ 	 * If first time through, read all tuples from outer plan and pass them to
+ 	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ 	 */
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "sorting subplan");
+ 
+ 	/*
+ 	 * Want to scan subplan in the forward direction while creating the
+ 	 * sorted data.
+ 	 */
+ 	estate->es_direction = ForwardScanDirection;
+ 
+ 	/*
+ 	 * Initialize tuplesort module.
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "calling tuplesort_begin");
+ 
+ 	outerNode = outerPlanState(node);
+ 	tupDesc = ExecGetResultType(outerNode);
+ 
+ 	if (node->tuplesortstate == NULL)
+ 	{
+ 		/*
+ 		 * We are going to process the first group of presorted data.
+ 		 * Initialize support structures for cmpSortSkipCols - already
+ 		 * sorted columns.
+ 		 */
+ 		prepareSkipCols(node);
+ 
+ 		/*
+ 		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+ 		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+ 		 * necessary have equal value of the first column.  We unlikely will
+ 		 * have huge groups with incremental sort.  Therefore usage of
+ 		 * abbreviated keys would be likely a waste of time.
+ 		 */
+ 		tuplesortstate = tuplesort_begin_heap(
+ 									tupDesc,
+ 									plannode->sort.numCols,
+ 									plannode->sort.sortColIdx,
+ 									plannode->sort.sortOperators,
+ 									plannode->sort.collations,
+ 									plannode->sort.nullsFirst,
+ 									work_mem,
+ 									false,
+ 									true);
+ 		node->tuplesortstate = (void *) tuplesortstate;
+ 		node->groupsCount++;
+ 	}
+ 	else
+ 	{
+ 		/* Next group of presorted data */
+ 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ 		node->groupsCount++;
+ 	}
+ 
+ 	/* Calculate remaining bound for bounded sort */
+ 	if (node->bounded)
+ 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ 
+ 	/*
+ 	 * Put next group of tuples where skipCols sort values are equal to
+ 	 * tuplesort.
+ 	 */
+ 	for (;;)
+ 	{
+ 		slot = ExecProcNode(outerNode);
+ 
+ 		/* Put next group of presorted data to the tuplesort */
+ 		if (nTuples < MIN_GROUP_SIZE)
+ 		{
+ 			if (!TupIsNull(node->prevSlot))
+ 			{
+ 				tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ 				ExecClearTuple(node->prevSlot);
+ 				nTuples++;
+ 			}
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			tuplesort_puttupleslot(tuplesortstate, slot);
+ 			nTuples++;
+ 		}
+ 		else if (TupIsNull(node->prevSlot))
+ 		{
+ 			/* First tuple */
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			else
+ 			{
+ 				ExecCopySlot(node->prevSlot, slot);
+ 			}
+ 		}
+ 		else
+ 		{
+ 			/* Put previous tuple into tuplesort */
+ 			tuplesort_puttupleslot(tuplesortstate, node->prevSlot);
+ 			nTuples++;
+ 
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			else
+ 			{
+ 				bool	cmp;
+ 				cmp = cmpSortSkipCols(node, node->prevSlot, slot);
+ 
+ 				/* Replace previous tuple with current one */
+ 				ExecCopySlot(node->prevSlot, slot);
+ 
+ 				/*
+ 				 * When skipCols are not equal then group of presorted data
+ 				 * is finished
+ 				 */
+ 				if (!cmp)
+ 					break;
+ 			}
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Complete the sort.
+ 	 */
+ 	tuplesort_performsort(tuplesortstate);
+ 
+ 	/*
+ 	 * restore to user specified direction
+ 	 */
+ 	estate->es_direction = dir;
+ 
+ 	/*
+ 	 * finally set the sorted flag to true
+ 	 */
+ 	node->sort_Done = true;
+ 	node->bounded_Done = node->bounded;
+ 
+ 	/*
+ 	 * Adjust bound_Done with number of tuples we've actually sorted.
+ 	 */
+ 	if (node->bounded)
+ 	{
+ 		if (node->finished)
+ 			node->bound_Done = node->bound;
+ 		else
+ 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ 	}
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "retrieving tuple from tuplesort");
+ 
+ 	/*
+ 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+ 	 * tuples.
+ 	 */
+ 	slot = node->ss.ps.ps_ResultTupleSlot;
+ 	(void) tuplesort_gettupleslot(tuplesortstate,
+ 								  ScanDirectionIsForward(dir),
+ 								  false, slot, NULL);
+ 	return slot;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitIncrementalSort
+  *
+  *		Creates the run-time state information for the sort node
+  *		produced by the planner and initializes its outer subtree.
+  * ----------------------------------------------------------------
+  */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ 	IncrementalSortState   *incrsortstate;
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "initializing sort node");
+ 
+ 	/*
+ 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ 	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ 	 * bucket in tuplesortstate.
+ 	 */
+ 	Assert((eflags & (EXEC_FLAG_REWIND |
+ 					  EXEC_FLAG_BACKWARD |
+ 					  EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
+ 	 * create state structure
+ 	 */
+ 	incrsortstate = makeNode(IncrementalSortState);
+ 	incrsortstate->ss.ps.plan = (Plan *) node;
+ 	incrsortstate->ss.ps.state = estate;
+ 
+ 	incrsortstate->bounded = false;
+ 	incrsortstate->sort_Done = false;
+ 	incrsortstate->finished = false;
+ 	incrsortstate->tuplesortstate = NULL;
+ 	incrsortstate->prevSlot = NULL;
+ 	incrsortstate->bound_Done = 0;
+ 	incrsortstate->groupsCount = 0;
+ 	incrsortstate->skipKeys = NULL;
+ 
+ 	/*
+ 	 * Miscellaneous initialization
+ 	 *
+ 	 * Sort nodes don't initialize their ExprContexts because they never call
+ 	 * ExecQual or ExecProject.
+ 	 */
+ 
+ 	/*
+ 	 * tuple table initialization
+ 	 *
+ 	 * sort nodes only return scan tuples from their sorted relation.
+ 	 */
+ 	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ 	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+ 
+ 	/*
+ 	 * initialize child nodes
+ 	 *
+ 	 * We shield the child node from the need to support REWIND, BACKWARD, or
+ 	 * MARK/RESTORE.
+ 	 */
+ 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+ 
+ 	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+ 
+ 	/*
+ 	 * initialize tuple type.  no need to initialize projection info because
+ 	 * this node doesn't do projections.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ 	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+ 
+ 	/* make standalone slot to store previous tuple from outer node */
+ 	incrsortstate->prevSlot = MakeSingleTupleTableSlot(
+ 							ExecGetResultType(outerPlanState(incrsortstate)));
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "sort node initialized");
+ 
+ 	return incrsortstate;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecEndIncrementalSort(node)
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "shutting down sort node");
+ 
+ 	/*
+ 	 * clean out the tuple table
+ 	 */
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	/* must drop stanalone tuple slot from outer node */
+ 	ExecDropSingleTupleTableSlot(node->prevSlot);
+ 
+ 	/*
+ 	 * Release tuplesort resources
+ 	 */
+ 	if (node->tuplesortstate != NULL)
+ 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 
+ 	/*
+ 	 * shut down the subplan
+ 	 */
+ 	ExecEndNode(outerPlanState(node));
+ 
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "sort node shutdown");
+ }
+ 
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ 	PlanState  *outerPlan = outerPlanState(node);
+ 
+ 	/*
+ 	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ 	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ 	 * re-scan it at all.
+ 	 */
+ 	if (!node->sort_Done)
+ 		return;
+ 
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 
+ 	/*
+ 	 * If subnode is to be rescanned then we forget previous sort results; we
+ 	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+ 	 * bounded-sort parameters changed or we didn't select randomAccess.
+ 	 *
+ 	 * Otherwise we can just rewind and rescan the sorted output.
+ 	 */
+ 	node->sort_Done = false;
+ 	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 	node->bound_Done = 0;
+ 
+ 	/*
+ 	 * if chgParam of subnode is not null then plan will be re-scanned by
+ 	 * first ExecProcNode.
+ 	 */
+ 	if (outerPlan->chgParam == NULL)
+ 		ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 924b458..1809e5d
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 00a0fed..f57b6db
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 913,918 ****
--- 913,936 ----
  
  
  /*
+  * CopySortFields
+  *
+  *		This function copies the fields of the Sort node.  It is used by
+  *		all the copy functions for classes which inherit from Sort.
+  */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ 
+ 	COPY_SCALAR_FIELD(numCols);
+ 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+ 
+ /*
   * _copySort
   */
  static Sort *
*************** _copySort(const Sort *from)
*** 923,935 ****
  	/*
  	 * copy node superclass fields
  	 */
! 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
! 	COPY_SCALAR_FIELD(numCols);
! 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
  
  	return newnode;
  }
--- 941,969 ----
  	/*
  	 * copy node superclass fields
  	 */
! 	CopySortFields(from, newnode);
  
! 	return newnode;
! }
! 
! 
! /*
!  * _copyIncrementalSort
!  */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! 	IncrementalSort	   *newnode = makeNode(IncrementalSort);
! 
! 	/*
! 	 * copy node superclass fields
! 	 */
! 	CopySortFields((const Sort *) from, (Sort *) newnode);
! 
! 	/*
! 	 * copy remainder of node
! 	 */
! 	COPY_SCALAR_FIELD(skipCols);
  
  	return newnode;
  }
*************** copyObjectImpl(const void *from)
*** 4781,4786 ****
--- 4815,4823 ----
  		case T_Sort:
  			retval = _copySort(from);
  			break;
+ 		case T_IncrementalSort:
+ 			retval = _copyIncrementalSort(from);
+ 			break;
  		case T_Group:
  			retval = _copyGroup(from);
  			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index 28cef85..59eea51
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 839,850 ****
  }
  
  static void
! _outSort(StringInfo str, const Sort *node)
  {
  	int			i;
  
- 	WRITE_NODE_TYPE("SORT");
- 
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
--- 839,848 ----
  }
  
  static void
! _outSortInfo(StringInfo str, const Sort *node)
  {
  	int			i;
  
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 867,872 ****
--- 865,888 ----
  }
  
  static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ 	WRITE_NODE_TYPE("SORT");
+ 
+ 	_outSortInfo(str, node);
+ }
+ 
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ 	WRITE_NODE_TYPE("INCREMENTALSORT");
+ 
+ 	_outSortInfo(str, (const Sort *) node);
+ 
+ 	WRITE_INT_FIELD(skipCols);
+ }
+ 
+ static void
  _outUnique(StringInfo str, const Unique *node)
  {
  	int			i;
*************** outNode(StringInfo str, const void *obj)
*** 3693,3698 ****
--- 3709,3717 ----
  			case T_Sort:
  				_outSort(str, obj);
  				break;
+ 			case T_IncrementalSort:
+ 				_outIncrementalSort(str, obj);
+ 				break;
  			case T_Unique:
  				_outUnique(str, obj);
  				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index a883220..ccd49ec
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2036,2047 ****
  }
  
  /*
!  * _readSort
   */
! static Sort *
! _readSort(void)
  {
! 	READ_LOCALS(Sort);
  
  	ReadCommonPlan(&local_node->plan);
  
--- 2036,2048 ----
  }
  
  /*
!  * ReadCommonSort
!  *	Assign the basic stuff of all nodes that inherit from Sort
   */
! static void
! ReadCommonSort(Sort *local_node)
  {
! 	READ_TEMP_LOCALS();
  
  	ReadCommonPlan(&local_node->plan);
  
*************** _readSort(void)
*** 2050,2055 ****
--- 2051,2082 ----
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
  	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+ 
+ /*
+  * _readSort
+  */
+ static Sort *
+ _readSort(void)
+ {
+ 	READ_LOCALS_NO_FIELDS(Sort);
+ 
+ 	ReadCommonSort(local_node);
+ 
+ 	READ_DONE();
+ }
+ 
+ /*
+  * _readIncrementalSort
+  */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ 	READ_LOCALS(IncrementalSort);
+ 
+ 	ReadCommonSort(&local_node->sort);
+ 
+ 	READ_INT_FIELD(skipCols);
  
  	READ_DONE();
  }
*************** parseNodeString(void)
*** 2602,2607 ****
--- 2629,2636 ----
  		return_value = _readMaterial();
  	else if (MATCH("SORT", 4))
  		return_value = _readSort();
+ 	else if (MATCH("INCREMENTALSORT", 7))
+ 		return_value = _readIncrementalSort();
  	else if (MATCH("GROUP", 5))
  		return_value = _readGroup();
  	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index b93b4fc..74c047a
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3280,3285 ****
--- 3280,3289 ----
  			ptype = "Sort";
  			subpath = ((SortPath *) path)->subpath;
  			break;
+ 		case T_IncrementalSortPath:
+ 			ptype = "IncrementalSort";
+ 			subpath = ((SortPath *) path)->subpath;
+ 			break;
  		case T_GroupPath:
  			ptype = "Group";
  			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 52643d0..165d049
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
  bool		enable_bitmapscan = true;
  bool		enable_tidscan = true;
  bool		enable_sort = true;
+ bool		enable_incrementalsort = true;
  bool		enable_hashagg = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
*************** cost_recursive_union(Path *runion, Path 
*** 1600,1605 ****
--- 1601,1613 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or incremental sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1626,1632 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1634,1641 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1642,1660 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
  
  	path->rows = tuples;
  
--- 1651,1678 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
+ 	if (!enable_incrementalsort)
+ 		presorted_keys = false;
  
  	path->rows = tuples;
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1680,1692 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1698,1747 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1696,1702 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1751,1757 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1707,1716 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1762,1771 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1718,1731 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1773,1805 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/*
! 		 * We'll use plain quicksort on all the input tuples.  If it appears
! 		 * that we expect less than two tuples per sort group then assume
! 		 * logarithmic part of estimate to be 1.
! 		 */
! 		if (group_tuples >= 2.0)
! 			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! 		else
! 			group_cost = comparison_cost * group_tuples;
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1736,1741 ****
--- 1810,1828 ----
  	 */
  	run_cost += cpu_operator_cost * tuples;
  
+ 	/* Extra costs of incremental sort */
+ 	if (presorted_keys > 0)
+ 	{
+ 		/*
+ 		 * In incremental sort case we also have to cost to detect sort groups.
+ 		 * It turns out into extra copy and comparison for each tuple.
+ 		 */
+ 		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+ 
+ 		/* Cost of per group tuplesort reset */
+ 		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ 	}
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2489,2494 ****
--- 2576,2583 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2515,2520 ****
--- 2604,2611 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 2c26906..2da6f40
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
  	return PATHKEYS_EQUAL;
  }
  
+ 
+ /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
  /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 402,413 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied only partially then we would have to do
!  *	  incremental sort in order to satisfy pathkeys completely.  Since
!  *	  incremental sort consumes data by presorted groups, we would have to
!  *	  consume more data than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
! 	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
- 
- 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1521,1562 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by incremental sort.
   */
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
  {
! 	int	n_common_pathkeys;
! 
! 	if (query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
! 
! 	if (enable_incrementalsort)
  	{
! 		/*
! 		 * Return the number of path keys in common, or 0 if there are none. Any
! 		 * first common pathkeys could be useful for ordering because we can use
! 		 * incremental sort.
! 		 */
! 		return n_common_pathkeys;
! 	}
! 	else
! 	{
! 		/* 
! 		 * When incremental sort is disabled, pathkeys are useful only when they
! 		 * do contain all the query pathkeys.
! 		 */
! 		if (n_common_pathkeys == list_length(query_pathkeys))
! 			return n_common_pathkeys;
! 		else
! 			return 0;
  	}
  }
  
  /*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
--- 1572,1578 ----
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 95e6eb7..fbee577
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 237,243 ****
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 237,243 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 252,261 ****
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 252,263 ----
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 437,442 ****
--- 439,445 ----
  											   (GatherPath *) best_path);
  			break;
  		case T_Sort:
+ 		case T_IncrementalSort:
  			plan = (Plan *) create_sort_plan(root,
  											 (SortPath *) best_path,
  											 flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1112,1117 ****
--- 1115,1121 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1146,1154 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1150,1160 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1497,1502 ****
--- 1503,1509 ----
  	Plan	   *subplan;
  	List	   *pathkeys = best_path->path.pathkeys;
  	List	   *tlist = build_path_tlist(root, &best_path->path);
+ 	int			n_common_pathkeys;
  
  	/* As with Gather, it's best to project away columns in the workers. */
  	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1523,1534 ****
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
--- 1530,1545 ----
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! 	if (n_common_pathkeys < list_length(pathkeys))
! 	{
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ 									 n_common_pathkeys,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
+ 	}
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1641,1646 ****
--- 1652,1658 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1650,1656 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1662,1672 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1894,1900 ****
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan);
  			}
  
  			if (!rollup->is_hashed)
--- 1910,1917 ----
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan,
! 											 0);
  			}
  
  			if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3830,3837 ****
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3847,3860 ----
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3842,3849 ****
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3865,3878 ----
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4901,4907 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4930,4937 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5490,5502 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node = makeNode(Sort);
! 	Plan	   *plan = &node->plan;
  
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
--- 5520,5550 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node;
! 	Plan	   *plan;
  
+ 	/* Always use regular sort node when enable_incrementalsort = false */
+ 	if (!enable_incrementalsort)
+ 		skipCols = 0;
+ 
+ 	if (skipCols == 0)
+ 	{
+ 		node = makeNode(Sort);
+ 	}
+ 	else
+ 	{
+ 		IncrementalSort    *incrementalSort;
+ 
+ 		incrementalSort = makeNode(IncrementalSort);
+ 		node = &incrementalSort->sort;
+ 		incrementalSort->skipCols = skipCols;
+ 	}
+ 
+ 	plan = &node->plan;
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5829,5835 ****
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5877,5883 ----
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5849,5855 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5897,5903 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5892,5898 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5940,5946 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5913,5919 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5961,5968 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5946,5952 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5995,6001 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** is_projection_capable_plan(Plan *plan)
*** 6596,6601 ****
--- 6645,6651 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 5565736..eaf7a78
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 649a233..b1f85e6
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3752,3765 ****
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				bool		is_sorted;
  
! 				is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 												  path->pathkeys);
! 				if (path == cheapest_partial_path || is_sorted)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (!is_sorted)
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
--- 3752,3765 ----
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				int			n_useful_pathkeys;
  
! 				n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (n_useful_pathkeys < list_length(root->group_pathkeys))
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3832,3845 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3832,3845 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_useful_pathkeys;
  
! 			n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 			if (path == cheapest_path || n_useful_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4906,4918 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 4906,4918 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_useful_pathkeys;
  
! 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! 														 path->pathkeys);
! 		if (path == cheapest_input_path || n_useful_pathkeys > 0)
  		{
! 			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 6041,6048 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 6041,6049 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index 1278371..2a894ae
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 634,639 ****
--- 634,640 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index c1be34d..88143d2
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2701,2706 ****
--- 2701,2707 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_Gather:
  		case T_GatherMerge:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index a1be858..f3f885f
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 973,979 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 973,980 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 2d5caae..eff7ac1
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1297,1308 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1297,1309 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1316,1321 ****
--- 1317,1324 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1552,1558 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1555,1562 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1644,1649 ****
--- 1648,1654 ----
  	GatherMergePath *pathnode = makeNode(GatherMergePath);
  	Cost			 input_startup_cost = 0;
  	Cost			 input_total_cost = 0;
+ 	int				 n_common_pathkeys;
  
  	Assert(subpath->parallel_safe);
  	Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1660,1666 ****
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
--- 1665,1673 ----
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 
! 	if (n_common_pathkeys == list_length(pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1674,1679 ****
--- 1681,1688 ----
  		cost_sort(&sort_path,
  				  root,
  				  pathkeys,
+ 				  n_common_pathkeys,
+ 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  subpath->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2516,2524 ****
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode = makeNode(SortPath);
  
- 	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
--- 2525,2555 ----
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode;
! 	int			n_common_pathkeys;
! 
! 	if (enable_incrementalsort)
! 		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! 	else
! 		n_common_pathkeys = 0;
! 
! 	if (n_common_pathkeys == 0)
! 	{
! 		pathnode = makeNode(SortPath);
! 		pathnode->path.pathtype = T_Sort;
! 	}
! 	else
! 	{
! 		IncrementalSortPath   *incpathnode;
! 
! 		incpathnode = makeNode(IncrementalSortPath);
! 		pathnode = &incpathnode->spath;
! 		pathnode->path.pathtype = T_IncrementalSort;
! 		incpathnode->skipCols = n_common_pathkeys;
! 	}
! 
! 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2532,2538 ****
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2563,2571 ----
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2840,2846 ****
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
--- 2873,2880 ----
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL, 0,
! 						  0.0,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 8502fcf..0af631a
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 277,283 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index a35b93b..885bf43
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3568,3573 ****
--- 3568,3609 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucketsize fraction (ie, number of entries in a bucket
   * divided by total tuples in relation) if the specified expression is used
   * as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index a414fb2..761c093
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 861,866 ****
--- 861,875 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of incremental sort steps."),
+ 			NULL
+ 		},
+ 		&enable_incrementalsort,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hashed aggregation plans."),
  			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 5f62cd5..9822e27
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 282,287 ****
--- 282,294 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	int64		maxSpace;		/* maximum amount of space occupied among sort
+ 								   of groups, either in-memory or on-disk */
+ 	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+ 								   space, fase when it's value for in-memory
+ 								   space */
+ 	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 636,641 ****
--- 643,651 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 670,688 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 680,709 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 699,705 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 720,726 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 717,722 ****
--- 738,744 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 757,769 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 779,792 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 805,811 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 828,834 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 836,842 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 859,865 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 927,933 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 950,956 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 1002,1008 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1025,1031 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1044,1050 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1067,1073 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1155,1170 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1178,1189 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1223,1229 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1242,1339 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	spaceUsed;
! 	bool	spaceUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		spaceUsedOnDisk = true;
! 		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		spaceUsedOnDisk = false;
! 		spaceUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	if (spaceUsed > state->maxSpace)
! 	{
! 		state->maxSpace = spaceUsed;
! 		state->maxSpaceOnDisk = spaceUsedOnDisk;
! 		state->maxSpaceStatus = state->status;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->availMem = state->allowedMem;
! 	state->lastReturnedTuple = NULL;
! 	state->slabAllocatorUsed = false;
! 	state->slabMemoryBegin = NULL;
! 	state->slabMemoryEnd = NULL;
! 	state->slabFreeHead = NULL;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3235,3261 ****
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
  		*spaceType = "Disk";
- 		*spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		*spaceType = "Memory";
! 		*spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3345,3359 ----
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxSpaceOnDisk)
  		*spaceType = "Disk";
  	else
  		*spaceType = "Memory";
! 	*spaceUsed = (state->maxSpace + 1023) / 1024;
  
! 	switch (state->maxSpaceStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncrementalSort.h
+  *
+  *
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/executor/nodeIncrementalSort.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+ 
+ #include "nodes/execnodes.h"
+ 
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ 													EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+ 
+ #endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 4330a85..fd69c0f
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1680,1685 ****
--- 1680,1699 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1696,1701 ****
--- 1710,1735 ----
  	void	   *tuplesortstate; /* private state of tuplesort.c */
  } SortState;
  
+ /* ----------------
+  *	 IncrementalSortState information
+  * ----------------
+  */
+ typedef struct IncrementalSortState
+ {
+ 	ScanState	ss;				/* its first field is NodeTag */
+ 	bool		bounded;		/* is the result set bounded? */
+ 	int64		bound;			/* if bounded, how many tuples are needed */
+ 	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
+ 	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	int64		bound_Done;		/* value of bound we did the sort with */
+ 	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	int64		groupsCount;	/* number of groups with equal skip keys */
+ 	TupleTableSlot *prevSlot;	/* slot for previous tuple from outer node */
+ } IncrementalSortState;
+ 
  /* ---------------------
   *	GroupState information
   * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index f59d719..3e76ce3
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
  	T_HashJoin,
  	T_Material,
  	T_Sort,
+ 	T_IncrementalSort,
  	T_Group,
  	T_Agg,
  	T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
  	T_HashJoinState,
  	T_MaterialState,
  	T_SortState,
+ 	T_IncrementalSortState,
  	T_GroupState,
  	T_AggState,
  	T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
  	T_ProjectionPath,
  	T_ProjectSetPath,
  	T_SortPath,
+ 	T_IncrementalSortPath,
  	T_GroupPath,
  	T_UpperUniquePath,
  	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index cba9155..cfebbc5
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 740,745 ****
--- 740,756 ----
  	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
  } Sort;
  
+ 
+ /* ----------------
+  *		incremental sort node
+  * ----------------
+  */
+ typedef struct IncrementalSort
+ {
+ 	Sort		sort;
+ 	int			skipCols;		/* number of presorted columns */
+ } IncrementalSort;
+ 
  /* ---------------
   *	 group node -
   *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 7a8e2fd..9f5cc6f
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1418,1423 ****
--- 1418,1433 ----
  } SortPath;
  
  /*
+  * IncrementalSortPath
+  */
+ typedef struct IncrementalSortPath
+ {
+ 	SortPath	spath;
+ 	int			skipCols;
+ } IncrementalSortPath;
+ 
+ 
+ /*
   * GroupPath represents grouping (of presorted input)
   *
   * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index ed70def..47c26c4
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_bitmapscan;
  extern bool enable_tidscan;
  extern bool enable_sort;
+ extern bool enable_incrementalsort;
  extern bool enable_hashagg;
  extern bool enable_nestloop;
  extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 102,109 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 103,111 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 25fe78c..01073dd
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
  extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
  							  List *mergeclauses,
  							  List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
  extern List *truncate_useless_pathkeys(PlannerInfo *root,
  						  RelOptInfo *rel,
  						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
  						 double nbuckets);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 14b9026..4ea68e7
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 62,69 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort           
*** 19,27 ****
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Sort           
    Sort Key: id, data
!   ->  Seq Scan on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
--- 19,28 ----
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Incremental Sort
    Sort Key: id, data
!   Presorted Key: id
!   ->  Index Scan using test_dc_pkey on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 6163ed8..9553648
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE:  drop cascades to table matest1
*** 1493,1498 ****
--- 1493,1499 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
  SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1633,1641 ****
--- 1634,1678 ----
   {3,7,8,10,13,13,16,18,19,22}
  (3 rows)
  
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+                                QUERY PLAN                                
+ -------------------------------------------------------------------------
+  Merge Append
+    Sort Key: tenk1.thousand, tenk1.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+    ->  Incremental Sort
+          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
+          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Merge Append
+    Sort Key: a.thousand, a.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+    ->  Incremental Sort
+          Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
+          ->  Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  --
  -- Check that constraint exclusion works correctly with partitions using
  -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!          name         | setting 
! ----------------------+---------
!  enable_bitmapscan    | on
!  enable_gathermerge   | on
!  enable_hashagg       | on
!  enable_hashjoin      | on
!  enable_indexonlyscan | on
!  enable_indexscan     | on
!  enable_material      | on
!  enable_mergejoin     | on
!  enable_nestloop      | on
!  enable_seqscan       | on
!  enable_sort          | on
!  enable_tidscan       | on
! (12 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
--- 70,91 ----
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!           name          | setting 
! ------------------------+---------
!  enable_bitmapscan      | on
!  enable_gathermerge     | on
!  enable_hashagg         | on
!  enable_hashjoin        | on
!  enable_incrementalsort | on
!  enable_indexonlyscan   | on
!  enable_indexscan       | on
!  enable_material        | on
!  enable_mergejoin       | on
!  enable_nestloop        | on
!  enable_seqscan         | on
!  enable_sort            | on
!  enable_tidscan         | on
! (13 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index d43b75c..ec611f5
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 527,532 ****
--- 527,533 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
*************** SELECT
*** 588,596 ****
--- 589,614 ----
      ORDER BY f.i LIMIT 10)
  FROM generate_series(1, 3) g(i);
  
+ set enable_incrementalsort = on;
+ 
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  
  --
  -- Check that constraint exclusion works correctly with partitions using

#19

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Alexander Korotkov (#18)

Re: [PATCH] Incremental sort

On Wed, Apr 26, 2017 at 8:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

That appears to be wrong. I intended to make cost_sort prefer plain sort
over incremental sort for this dataset size. But, that appears to be not
always right solution. Quick sort is so fast only on presorted data.

As you may know, I've often said that the precheck for sorted input
added to our quicksort implementation by a3f0b3d is misguided. It
sometimes throws away a ton of work if the presorted input isn't
*perfectly* presorted. This happens when the first out of order tuple
is towards the end of the presorted input.

I think that it isn't fair to credit our qsort with doing so well on a
100% presorted case, because it doesn't do the necessary bookkeeping
to not throw that work away completely in certain important cases.

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Alexander Korotkov

a.korotkov@postgrespro.ru

over 8 years ago

In reply to: Peter Geoghegan (#19)

Re: [PATCH] Incremental sort

On Wed, Apr 26, 2017 at 7:56 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 26, 2017 at 8:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

That appears to be wrong. I intended to make cost_sort prefer plain sort
over incremental sort for this dataset size. But, that appears to be not
always right solution. Quick sort is so fast only on presorted data.

As you may know, I've often said that the precheck for sorted input
added to our quicksort implementation by a3f0b3d is misguided. It
sometimes throws away a ton of work if the presorted input isn't
*perfectly* presorted. This happens when the first out of order tuple
is towards the end of the presorted input.

I think that it isn't fair to credit our qsort with doing so well on a
100% presorted case, because it doesn't do the necessary bookkeeping
to not throw that work away completely in certain important cases.

OK, I get it. Our qsort is so fast not only on 100% presorted case.
However, that doesn't change many things in context of incremental sort.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#21

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Alexander Korotkov (#20)

Re: [PATCH] Incremental sort

On Wed, Apr 26, 2017 at 10:10 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

OK, I get it. Our qsort is so fast not only on 100% presorted case.
However, that doesn't change many things in context of incremental sort.

The important point is to make any presorted test case only ~99%
presorted, so as to not give too much credit to the "high risk"
presort check optimization.

The switch to insertion sort that we left in (not the bad one removed
by a3f0b3d -- the insertion sort that actually comes from the B&M
paper) does "legitimately" make sorting faster with presorted cases.

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Alexander Korotkov

a.korotkov@postgrespro.ru

over 8 years ago

In reply to: Peter Geoghegan (#21)

Re: [PATCH] Incremental sort

On Wed, Apr 26, 2017 at 8:20 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 26, 2017 at 10:10 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

OK, I get it. Our qsort is so fast not only on 100% presorted case.
However, that doesn't change many things in context of incremental sort.

The important point is to make any presorted test case only ~99%
presorted, so as to not give too much credit to the "high risk"
presort check optimization.

The switch to insertion sort that we left in (not the bad one removed
by a3f0b3d -- the insertion sort that actually comes from the B&M
paper) does "legitimately" make sorting faster with presorted cases.

I'm still focusing on making incremental sort not slower than qsort with
presorted optimization. Independently on whether this is "high risk"
optimization or not...
However, adding more test cases is always good.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#23

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Alexander Korotkov (#18)

Re: [PATCH] Incremental sort

On Wed, Apr 26, 2017 at 11:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

But I'd like to make incremental sort not slower than quicksort in case of
presorted data. New idea about it comes to my mind. Since cause of
incremental sort slowness in this case is too frequent reset of tuplesort,
then what if we would artificially put data in larger groups. Attached
revision of patch implements this: it doesn't stop to accumulate tuples to
tuplesort until we have MIN_GROUP_SIZE tuples.

Now, incremental sort is not slower than quicksort. And this seems to be
cool.
However, in the LIMIT case we will pay the price of fetching some extra
tuples from outer node. But, that doesn't seem to hurt us too much.

Any thoughts?

Nice idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Alexander Korotkov

a.korotkov@postgrespro.ru

over 8 years ago

In reply to: Robert Haas (#23)

Re: [PATCH] Incremental sort

On Thu, Apr 27, 2017 at 5:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 26, 2017 at 11:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

But I'd like to make incremental sort not slower than quicksort in case

of

presorted data. New idea about it comes to my mind. Since cause of
incremental sort slowness in this case is too frequent reset of

tuplesort,

then what if we would artificially put data in larger groups. Attached
revision of patch implements this: it doesn't stop to accumulate tuples

to

tuplesort until we have MIN_GROUP_SIZE tuples.

Now, incremental sort is not slower than quicksort. And this seems to be
cool.
However, in the LIMIT case we will pay the price of fetching some extra
tuples from outer node. But, that doesn't seem to hurt us too much.

Any thoughts?

Nice idea.

Cool.
Than I'm going to make a set of synthetic performance tests in order to
ensure that there is no regression.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#25

Alexander Korotkov

a.korotkov@postgrespro.ru

over 8 years ago

In reply to: Alexander Korotkov (#24)

3 attachment(s)

Re: [PATCH] Incremental sort

On Thu, Apr 27, 2017 at 5:23 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Thu, Apr 27, 2017 at 5:06 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Wed, Apr 26, 2017 at 11:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

But I'd like to make incremental sort not slower than quicksort in case

of

presorted data. New idea about it comes to my mind. Since cause of
incremental sort slowness in this case is too frequent reset of

tuplesort,

then what if we would artificially put data in larger groups. Attached
revision of patch implements this: it doesn't stop to accumulate tuples

to

tuplesort until we have MIN_GROUP_SIZE tuples.

Now, incremental sort is not slower than quicksort. And this seems to

be

cool.
However, in the LIMIT case we will pay the price of fetching some extra
tuples from outer node. But, that doesn't seem to hurt us too much.

Any thoughts?

Nice idea.

Cool.
Than I'm going to make a set of synthetic performance tests in order to
ensure that there is no regression.

Next revision of patch is attached.
This revision contains one important optimization. I found that it's not
necessary to make every tuple go through prevTuple slot. It's enough to
save single sample tuple per sort group in order to compare skip columns
with it. This optimization allows to evade regression on large sort groups
which I have observed.

I'm also attaching python script (incsort_test.py) which I use for
synthetic performance benchmarking. This script runs benchmarks which are
similar to one posted by Heikki, but with some variations. These
benchmarks are aimed to check if there are cases when incremental sort is
slower than plain sort.

This script generates tables with structure described in 'tables' array.
For generation of texts, md5 function is used. For first GroupedCols
number of table columns, groups of GroupSize equal values are generated.
Then there are columns which values are just sequential. In the last column
have PreorderedFrac fraction of sequential values and rest of values are
random. Therefore, we can measure influence of presorted optimization in
qsort with various fractions of presorted data. Also there is btree index
which covers all the columns of that table.

The benchmark query select contents of generated table order by grouped
columns and by last column. Index only scan outputs tuples ordered by
grouped columns, and incremental sort have to perform sorting inside those
groups. Plain sort case is forced to also use index only scans, in order
to compare sort methods not scan methods.

Results are also attached (results.csv). Last column contains difference
between incremental and plain sort time in percents. Negative value mean
that incremental sort is faster in this case.

Incremental sort is faster in vast majority of cases. It appears to be
slower only when whose dataset is one sort group. In this case incremental
sort is useless, and it should be considered as misuse of incremental
sort. Slowdown is related to the fact that we anyway have to do extra
comparisons, unless we somehow push our comparison result into qsort itself
and save some cpu cycles (but that would be unreasonable break of
encapsulation). Thus, in such cases regression seems to be inevitable
anyway. I think we could evade this regression during query planning. If
we see that there would be only few groups, we should choose plain sort
instead of incremental sort.

Any thoughts?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-7.patchapplication/octet-stream; name=incremental-sort-7.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index d1bc5b0..c9de7ea
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1943,1981 ****
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
   Limit
!    Output: t1.c1, t2.c1
     ->  Sort
!          Output: t1.c1, t2.c1
!          Sort Key: t1.c1, t2.c1
           ->  Nested Loop
!                Output: t1.c1, t2.c1
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c1
!                      Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c1
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c1
!                            Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!  c1 | c1  
! ----+-----
!   1 | 101
!   1 | 102
!   1 | 103
!   1 | 104
!   1 | 105
!   1 | 106
!   1 | 107
!   1 | 108
!   1 | 109
!   1 | 110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
--- 1943,1981 ----
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!                             QUERY PLAN                            
! ------------------------------------------------------------------
   Limit
!    Output: t1.c3, t2.c3
     ->  Sort
!          Output: t1.c3, t2.c3
!          Sort Key: t1.c3, t2.c3
           ->  Nested Loop
!                Output: t1.c3, t2.c3
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c3
!                      Remote SQL: SELECT c3 FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c3
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c3
!                            Remote SQL: SELECT c3 FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!   c3   |  c3   
! -------+-------
!  00001 | 00101
!  00001 | 00102
!  00001 | 00103
!  00001 | 00104
!  00001 | 00105
!  00001 | 00106
!  00001 | 00107
!  00001 | 00108
!  00001 | 00109
!  00001 | 00110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
*************** select c2/2, sum(c2) * (c2/2) from ft1 g
*** 2517,2534 ****
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                 QUERY PLAN                                                
! ----------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          ->  Foreign Scan
!                Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
!                Relations: Aggregate on (public.ft1)
!                Remote SQL: SELECT c2, sum("C 1"), sqrt("C 1") FROM "S 1"."T 1" GROUP BY c2, (sqrt("C 1"))
! (9 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
--- 2517,2537 ----
  -- Aggregates in subquery are pushed down.
  explain (verbose, costs off)
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
!                                                         QUERY PLAN                                                        
! --------------------------------------------------------------------------------------------------------------------------
   Aggregate
     Output: count(ft1.c2), sum(ft1.c2)
!    ->  Incremental Sort
           Output: ft1.c2, (sum(ft1.c1)), (sqrt((ft1.c1)::double precision))
           Sort Key: ft1.c2, (sum(ft1.c1))
!          Presorted Key: ft1.c2
!          ->  GroupAggregate
!                Output: ft1.c2, sum(ft1.c1), (sqrt((ft1.c1)::double precision))
!                Group Key: ft1.c2, sqrt((ft1.c1)::double precision)
!                ->  Foreign Scan on public.ft1
!                      Output: ft1.c2, sqrt((ft1.c1)::double precision), ft1.c1
!                      Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" ORDER BY c2 ASC NULLS LAST, sqrt("C 1") ASC NULLS LAST
! (12 rows)
  
  select count(x.a), sum(x.a) from (select c2 a, sum(c1) b from ft1 group by c2, sqrt(c1) order by 1, 2) x;
   count | sum  
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 509bb54..263a646
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 487,494 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 487,494 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 0b9e300..84a26d9
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3563,3568 ****
--- 3563,3582 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+       <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of incremental sort
+         steps. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
        <term><varname>enable_indexscan</varname> (<type>boolean</type>)
        <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 9359d0a..52987bb
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual, 
*** 80,85 ****
--- 80,87 ----
  				ExplainState *es);
  static void show_sort_keys(SortState *sortstate, List *ancestors,
  			   ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 					   List *ancestors, ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
  static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
  				 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 									   ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  					ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1003,1008 ****
--- 1007,1015 ----
  		case T_Sort:
  			pname = sname = "Sort";
  			break;
+ 		case T_IncrementalSort:
+ 			pname = sname = "Incremental Sort";
+ 			break;
  		case T_Group:
  			pname = sname = "Group";
  			break;
*************** ExplainNode(PlanState *planstate, List *
*** 1593,1598 ****
--- 1600,1611 ----
  			show_sort_keys(castNode(SortState, planstate), ancestors, es);
  			show_sort_info(castNode(SortState, planstate), es);
  			break;
+ 		case T_IncrementalSort:
+ 			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ 									   ancestors, es);
+ 			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ 									   es);
+ 			break;
  		case T_MergeAppend:
  			show_merge_append_keys(castNode(MergeAppendState, planstate),
  								   ancestors, es);
*************** static void
*** 1918,1932 ****
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
   * Likewise, for a MergeAppend node.
   */
  static void
--- 1931,1968 ----
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+ 	int			skipCols;
+ 
+ 	if (IsA(plan, IncrementalSort))
+ 		skipCols = ((IncrementalSort *) plan)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
+  * Show the sort keys for a IncrementalSort node.
+  */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 						   List *ancestors, ExplainState *es)
+ {
+ 	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+ 
+ 	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ 						 plan->sort.numCols, plan->skipCols,
+ 						 plan->sort.sortColIdx,
+ 						 plan->sort.sortOperators, plan->sort.collations,
+ 						 plan->sort.nullsFirst,
+ 						 ancestors, es);
+ }
+ 
+ /*
   * Likewise, for a MergeAppend node.
   */
  static void
*************** show_merge_append_keys(MergeAppendState 
*** 1936,1942 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1972,1978 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1960,1966 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1996,2002 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 2029,2035 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 2065,2071 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2086,2092 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 2122,2128 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2099,2111 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 2135,2148 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2145,2153 ****
--- 2182,2194 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2305,2310 ****
--- 2346,2388 ----
  }
  
  /*
+  * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+  */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 						   ExplainState *es)
+ {
+ 	if (es->analyze && incrsortstate->sort_Done &&
+ 		incrsortstate->tuplesortstate != NULL)
+ 	{
+ 		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ 		const char *sortMethod;
+ 		const char *spaceType;
+ 		long		spaceUsed;
+ 
+ 		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+ 							 sortMethod, spaceType, spaceUsed);
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort groups: %ld\n",
+ 							 incrsortstate->groupsCount);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyText("Sort Method", sortMethod, es);
+ 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			ExplainPropertyLong("Sort groups: %ld",
+ 								incrsortstate->groupsCount, es);
+ 		}
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 083b20f..b093618
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
!        nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
!        nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 7e85c66..e7fd9f9
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 30,35 ****
--- 30,36 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 248,253 ****
--- 249,258 ----
  			ExecReScanSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecReScanIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecReScanGroup((GroupState *) node);
  			break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 519,526 ****
--- 524,535 ----
  		case T_CteScan:
  		case T_Material:
  		case T_Sort:
+ 			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_IncrementalSort:
+ 			return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index 486ddf1..2f4a23a
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 93,98 ****
--- 93,99 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 308,313 ****
--- 309,319 ----
  												estate, eflags);
  			break;
  
+ 		case T_IncrementalSort:
+ 			result = (PlanState *) ExecInitIncrementalSort(
+ 									(IncrementalSort *) node, estate, eflags);
+ 			break;
+ 
  		case T_Group:
  			result = (PlanState *) ExecInitGroup((Group *) node,
  												 estate, eflags);
*************** ExecProcNode(PlanState *node)
*** 531,536 ****
--- 537,546 ----
  			result = ExecSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			result = ExecIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			result = ExecGroup((GroupState *) node);
  			break;
*************** ExecEndNode(PlanState *node)
*** 803,808 ****
--- 813,822 ----
  			ExecEndSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecEndIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecEndGroup((GroupState *) node);
  			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index c2b8618..551664c
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 655,660 ****
--- 655,661 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 736,742 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 737,743 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...79ae888
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,527 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncremenalSort.c
+  *	  Routines to handle incremental sorting of relations.
+  *
+  * DESCRIPTION
+  *
+  *		Incremental sort is specially optimized kind of multikey sort when
+  *		input is already presorted by prefix of required keys list.  Thus,
+  *		when it's required to sort by (key1, key2 ... keyN) and result is
+  *		already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+  *		values of (key1, key2 ... keyM) are equal.
+  *
+  *		Consider following example.  We have input tuples consisting from
+  *		two integers (x, y) already presorted by x, while it's required to
+  *		sort them by x and y.  Let input tuples be following.
+  *
+  *		(1, 5)
+  *		(1, 2)
+  *		(2, 10)
+  *		(2, 1)
+  *		(2, 5)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort algorithm would sort by xfollowing groups, which have
+  *		equal x, individually:
+  *			(1, 5) (1, 2)
+  *			(2, 10) (2, 1) (2, 5)
+  *			(3, 3) (3, 7)
+  *
+  *		After sorting these groups and putting them altogether, we would get
+  *		following tuple set which is actually sorted by x and y.
+  *
+  *		(1, 2)
+  *		(1, 5)
+  *		(2, 1)
+  *		(2, 5)
+  *		(2, 10)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort is faster than full sort on large datasets.  But
+  *		the case of most huge benefit of incremental sort is queries with
+  *		LIMIT because incremental sort can return first tuples without reading
+  *		whole input dataset.
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/executor/nodeIncremenalSort.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+ 
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ 															TupleTableSlot *b)
+ {
+ 	int n, i;
+ 
+ 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+ 
+ 	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = slot_getattr(a, attno, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	int					skipCols,
+ 						i;
+ 
+ 	Assert(IsA(plannode, IncrementalSort));
+ 	skipCols = plannode->skipCols;
+ 
+ 	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sort.sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 										plannode->sort.sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sort.sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->sort.collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
+ 
+ #define MIN_GROUP_SIZE 32
+ 
+ /* ----------------------------------------------------------------
+  *		ExecIncrementalSort
+  *
+  *		Assuming that outer subtree returns tuple presorted by some prefix
+  *		of target sort columns, performs incremental sort.  It fetches
+  *		groups of tuples where prefix sort columns are equal and sorts them
+  *		using tuplesort.  This approach allows to evade sorting of whole
+  *		dataset.  Besides taking less memory and being faster, it allows to
+  *		start returning tuples before fetching full dataset from outer
+  *		subtree.
+  *
+  *		Conditions:
+  *		  -- none.
+  *
+  *		Initial States:
+  *		  -- the outer child is prepared to return the first tuple.
+  * ----------------------------------------------------------------
+  */
+ TupleTableSlot *
+ ExecIncrementalSort(IncrementalSortState *node)
+ {
+ 	EState			   *estate;
+ 	ScanDirection		dir;
+ 	Tuplesortstate	   *tuplesortstate;
+ 	TupleTableSlot	   *slot;
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	PlanState		   *outerNode;
+ 	TupleDesc			tupDesc;
+ 	int64				nTuples = 0;
+ 
+ 	/*
+ 	 * get state info from node
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "entering routine");
+ 
+ 	estate = node->ss.ps.state;
+ 	dir = estate->es_direction;
+ 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+ 
+ 	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  false, slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
+ 	 * If first time through, read all tuples from outer plan and pass them to
+ 	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ 	 */
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "sorting subplan");
+ 
+ 	/*
+ 	 * Want to scan subplan in the forward direction while creating the
+ 	 * sorted data.
+ 	 */
+ 	estate->es_direction = ForwardScanDirection;
+ 
+ 	/*
+ 	 * Initialize tuplesort module.
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "calling tuplesort_begin");
+ 
+ 	outerNode = outerPlanState(node);
+ 	tupDesc = ExecGetResultType(outerNode);
+ 
+ 	if (node->tuplesortstate == NULL)
+ 	{
+ 		/*
+ 		 * We are going to process the first group of presorted data.
+ 		 * Initialize support structures for cmpSortSkipCols - already
+ 		 * sorted columns.
+ 		 */
+ 		prepareSkipCols(node);
+ 
+ 		/*
+ 		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+ 		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+ 		 * necessary have equal value of the first column.  We unlikely will
+ 		 * have huge groups with incremental sort.  Therefore usage of
+ 		 * abbreviated keys would be likely a waste of time.
+ 		 */
+ 		tuplesortstate = tuplesort_begin_heap(
+ 									tupDesc,
+ 									plannode->sort.numCols,
+ 									plannode->sort.sortColIdx,
+ 									plannode->sort.sortOperators,
+ 									plannode->sort.collations,
+ 									plannode->sort.nullsFirst,
+ 									work_mem,
+ 									false,
+ 									true);
+ 		node->tuplesortstate = (void *) tuplesortstate;
+ 		node->groupsCount++;
+ 	}
+ 	else
+ 	{
+ 		/* Next group of presorted data */
+ 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ 		node->groupsCount++;
+ 	}
+ 
+ 	/* Calculate remaining bound for bounded sort */
+ 	if (node->bounded)
+ 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ 
+ 	/* Put saved tuple to tuplesort if any */
+ 	if (!TupIsNull(node->sampleSlot))
+ 	{
+ 		tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ 		ExecClearTuple(node->sampleSlot);
+ 		nTuples++;
+ 	}
+ 
+ 	/*
+ 	 * Put next group of tuples where skipCols sort values are equal to
+ 	 * tuplesort.
+ 	 */
+ 	for (;;)
+ 	{
+ 		slot = ExecProcNode(outerNode);
+ 
+ 		if (TupIsNull(slot))
+ 		{
+ 			node->finished = true;
+ 			break;
+ 		}
+ 
+ 		/* Put next group of presorted data to the tuplesort */
+ 		if (nTuples < MIN_GROUP_SIZE)
+ 		{
+ 			tuplesort_puttupleslot(tuplesortstate, slot);
+ 
+ 			/* Save last tuple in minimal group */
+ 			if (nTuples == MIN_GROUP_SIZE - 1)
+ 				ExecCopySlot(node->sampleSlot, slot);
+ 			nTuples++;
+ 		}
+ 		else
+ 		{
+ 			/* Interate while skip cols are same as in saved tuple */
+ 			bool	cmp;
+ 			cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+ 
+ 			if (cmp)
+ 			{
+ 				tuplesort_puttupleslot(tuplesortstate, slot);
+ 				nTuples++;
+ 			}
+ 			else
+ 			{
+ 				ExecCopySlot(node->sampleSlot, slot);
+ 				break;
+ 			}
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Complete the sort.
+ 	 */
+ 	tuplesort_performsort(tuplesortstate);
+ 
+ 	/*
+ 	 * restore to user specified direction
+ 	 */
+ 	estate->es_direction = dir;
+ 
+ 	/*
+ 	 * finally set the sorted flag to true
+ 	 */
+ 	node->sort_Done = true;
+ 	node->bounded_Done = node->bounded;
+ 
+ 	/*
+ 	 * Adjust bound_Done with number of tuples we've actually sorted.
+ 	 */
+ 	if (node->bounded)
+ 	{
+ 		if (node->finished)
+ 			node->bound_Done = node->bound;
+ 		else
+ 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ 	}
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "retrieving tuple from tuplesort");
+ 
+ 	/*
+ 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+ 	 * tuples.
+ 	 */
+ 	slot = node->ss.ps.ps_ResultTupleSlot;
+ 	(void) tuplesort_gettupleslot(tuplesortstate,
+ 								  ScanDirectionIsForward(dir),
+ 								  false, slot, NULL);
+ 	return slot;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitIncrementalSort
+  *
+  *		Creates the run-time state information for the sort node
+  *		produced by the planner and initializes its outer subtree.
+  * ----------------------------------------------------------------
+  */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ 	IncrementalSortState   *incrsortstate;
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "initializing sort node");
+ 
+ 	/*
+ 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ 	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ 	 * bucket in tuplesortstate.
+ 	 */
+ 	Assert((eflags & (EXEC_FLAG_REWIND |
+ 					  EXEC_FLAG_BACKWARD |
+ 					  EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
+ 	 * create state structure
+ 	 */
+ 	incrsortstate = makeNode(IncrementalSortState);
+ 	incrsortstate->ss.ps.plan = (Plan *) node;
+ 	incrsortstate->ss.ps.state = estate;
+ 
+ 	incrsortstate->bounded = false;
+ 	incrsortstate->sort_Done = false;
+ 	incrsortstate->finished = false;
+ 	incrsortstate->tuplesortstate = NULL;
+ 	incrsortstate->sampleSlot = NULL;
+ 	incrsortstate->bound_Done = 0;
+ 	incrsortstate->groupsCount = 0;
+ 	incrsortstate->skipKeys = NULL;
+ 
+ 	/*
+ 	 * Miscellaneous initialization
+ 	 *
+ 	 * Sort nodes don't initialize their ExprContexts because they never call
+ 	 * ExecQual or ExecProject.
+ 	 */
+ 
+ 	/*
+ 	 * tuple table initialization
+ 	 *
+ 	 * sort nodes only return scan tuples from their sorted relation.
+ 	 */
+ 	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ 	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+ 
+ 	/*
+ 	 * initialize child nodes
+ 	 *
+ 	 * We shield the child node from the need to support REWIND, BACKWARD, or
+ 	 * MARK/RESTORE.
+ 	 */
+ 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+ 
+ 	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+ 
+ 	/*
+ 	 * initialize tuple type.  no need to initialize projection info because
+ 	 * this node doesn't do projections.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ 	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+ 
+ 	/* make standalone slot to store previous tuple from outer node */
+ 	incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ 							ExecGetResultType(outerPlanState(incrsortstate)));
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "sort node initialized");
+ 
+ 	return incrsortstate;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecEndIncrementalSort(node)
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "shutting down sort node");
+ 
+ 	/*
+ 	 * clean out the tuple table
+ 	 */
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	/* must drop stanalone tuple slot from outer node */
+ 	ExecDropSingleTupleTableSlot(node->sampleSlot);
+ 
+ 	/*
+ 	 * Release tuplesort resources
+ 	 */
+ 	if (node->tuplesortstate != NULL)
+ 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 
+ 	/*
+ 	 * shut down the subplan
+ 	 */
+ 	ExecEndNode(outerPlanState(node));
+ 
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "sort node shutdown");
+ }
+ 
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ 	PlanState  *outerPlan = outerPlanState(node);
+ 
+ 	/*
+ 	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ 	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ 	 * re-scan it at all.
+ 	 */
+ 	if (!node->sort_Done)
+ 		return;
+ 
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 
+ 	/*
+ 	 * If subnode is to be rescanned then we forget previous sort results; we
+ 	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+ 	 * bounded-sort parameters changed or we didn't select randomAccess.
+ 	 *
+ 	 * Otherwise we can just rewind and rescan the sorted output.
+ 	 */
+ 	node->sort_Done = false;
+ 	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 	node->bound_Done = 0;
+ 
+ 	/*
+ 	 * if chgParam of subnode is not null then plan will be re-scanned by
+ 	 * first ExecProcNode.
+ 	 */
+ 	if (outerPlan->chgParam == NULL)
+ 		ExecReScan(outerPlan);
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 924b458..1809e5d
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 89,95 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
--- 89,96 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 35a237a..2c2e17d
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 915,920 ****
--- 915,938 ----
  
  
  /*
+  * CopySortFields
+  *
+  *		This function copies the fields of the Sort node.  It is used by
+  *		all the copy functions for classes which inherit from Sort.
+  */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ 
+ 	COPY_SCALAR_FIELD(numCols);
+ 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+ 
+ /*
   * _copySort
   */
  static Sort *
*************** _copySort(const Sort *from)
*** 925,937 ****
  	/*
  	 * copy node superclass fields
  	 */
! 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
! 	COPY_SCALAR_FIELD(numCols);
! 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
  
  	return newnode;
  }
--- 943,971 ----
  	/*
  	 * copy node superclass fields
  	 */
! 	CopySortFields(from, newnode);
  
! 	return newnode;
! }
! 
! 
! /*
!  * _copyIncrementalSort
!  */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! 	IncrementalSort	   *newnode = makeNode(IncrementalSort);
! 
! 	/*
! 	 * copy node superclass fields
! 	 */
! 	CopySortFields((const Sort *) from, (Sort *) newnode);
! 
! 	/*
! 	 * copy remainder of node
! 	 */
! 	COPY_SCALAR_FIELD(skipCols);
  
  	return newnode;
  }
*************** copyObjectImpl(const void *from)
*** 4784,4789 ****
--- 4818,4826 ----
  		case T_Sort:
  			retval = _copySort(from);
  			break;
+ 		case T_IncrementalSort:
+ 			retval = _copyIncrementalSort(from);
+ 			break;
  		case T_Group:
  			retval = _copyGroup(from);
  			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index 98f6768..6944701
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 841,852 ****
  }
  
  static void
! _outSort(StringInfo str, const Sort *node)
  {
  	int			i;
  
- 	WRITE_NODE_TYPE("SORT");
- 
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
--- 841,850 ----
  }
  
  static void
! _outSortInfo(StringInfo str, const Sort *node)
  {
  	int			i;
  
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 869,874 ****
--- 867,890 ----
  }
  
  static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ 	WRITE_NODE_TYPE("SORT");
+ 
+ 	_outSortInfo(str, node);
+ }
+ 
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ 	WRITE_NODE_TYPE("INCREMENTALSORT");
+ 
+ 	_outSortInfo(str, (const Sort *) node);
+ 
+ 	WRITE_INT_FIELD(skipCols);
+ }
+ 
+ static void
  _outUnique(StringInfo str, const Unique *node)
  {
  	int			i;
*************** outNode(StringInfo str, const void *obj)
*** 3697,3702 ****
--- 3713,3721 ----
  			case T_Sort:
  				_outSort(str, obj);
  				break;
+ 			case T_IncrementalSort:
+ 				_outIncrementalSort(str, obj);
+ 				break;
  			case T_Unique:
  				_outUnique(str, obj);
  				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index f9a227e..ce1db85
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2038,2049 ****
  }
  
  /*
!  * _readSort
   */
! static Sort *
! _readSort(void)
  {
! 	READ_LOCALS(Sort);
  
  	ReadCommonPlan(&local_node->plan);
  
--- 2038,2050 ----
  }
  
  /*
!  * ReadCommonSort
!  *	Assign the basic stuff of all nodes that inherit from Sort
   */
! static void
! ReadCommonSort(Sort *local_node)
  {
! 	READ_TEMP_LOCALS();
  
  	ReadCommonPlan(&local_node->plan);
  
*************** _readSort(void)
*** 2052,2057 ****
--- 2053,2084 ----
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
  	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+ 
+ /*
+  * _readSort
+  */
+ static Sort *
+ _readSort(void)
+ {
+ 	READ_LOCALS_NO_FIELDS(Sort);
+ 
+ 	ReadCommonSort(local_node);
+ 
+ 	READ_DONE();
+ }
+ 
+ /*
+  * _readIncrementalSort
+  */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ 	READ_LOCALS(IncrementalSort);
+ 
+ 	ReadCommonSort(&local_node->sort);
+ 
+ 	READ_INT_FIELD(skipCols);
  
  	READ_DONE();
  }
*************** parseNodeString(void)
*** 2604,2609 ****
--- 2631,2638 ----
  		return_value = _readMaterial();
  	else if (MATCH("SORT", 4))
  		return_value = _readSort();
+ 	else if (MATCH("INCREMENTALSORT", 7))
+ 		return_value = _readIncrementalSort();
  	else if (MATCH("GROUP", 5))
  		return_value = _readGroup();
  	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index b93b4fc..74c047a
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3280,3285 ****
--- 3280,3289 ----
  			ptype = "Sort";
  			subpath = ((SortPath *) path)->subpath;
  			break;
+ 		case T_IncrementalSortPath:
+ 			ptype = "IncrementalSort";
+ 			subpath = ((SortPath *) path)->subpath;
+ 			break;
  		case T_GroupPath:
  			ptype = "Group";
  			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 52643d0..165d049
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
  bool		enable_bitmapscan = true;
  bool		enable_tidscan = true;
  bool		enable_sort = true;
+ bool		enable_incrementalsort = true;
  bool		enable_hashagg = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
*************** cost_recursive_union(Path *runion, Path 
*** 1600,1605 ****
--- 1601,1613 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or incremental sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1626,1632 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1634,1641 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1642,1660 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
  
  	path->rows = tuples;
  
--- 1651,1678 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
+ 	if (!enable_incrementalsort)
+ 		presorted_keys = false;
  
  	path->rows = tuples;
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1680,1692 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1698,1747 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1696,1702 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1751,1757 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1707,1716 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1762,1771 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1718,1731 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1773,1805 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/*
! 		 * We'll use plain quicksort on all the input tuples.  If it appears
! 		 * that we expect less than two tuples per sort group then assume
! 		 * logarithmic part of estimate to be 1.
! 		 */
! 		if (group_tuples >= 2.0)
! 			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! 		else
! 			group_cost = comparison_cost * group_tuples;
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1736,1741 ****
--- 1810,1828 ----
  	 */
  	run_cost += cpu_operator_cost * tuples;
  
+ 	/* Extra costs of incremental sort */
+ 	if (presorted_keys > 0)
+ 	{
+ 		/*
+ 		 * In incremental sort case we also have to cost to detect sort groups.
+ 		 * It turns out into extra copy and comparison for each tuple.
+ 		 */
+ 		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+ 
+ 		/* Cost of per group tuplesort reset */
+ 		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ 	}
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2489,2494 ****
--- 2576,2583 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2515,2520 ****
--- 2604,2611 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 2c26906..2da6f40
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
  	return PATHKEYS_EQUAL;
  }
  
+ 
+ /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
  /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 402,413 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied only partially then we would have to do
!  *	  incremental sort in order to satisfy pathkeys completely.  Since
!  *	  incremental sort consumes data by presorted groups, we would have to
!  *	  consume more data than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
! 	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
- 
- 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1521,1562 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by incremental sort.
   */
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
  {
! 	int	n_common_pathkeys;
! 
! 	if (query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
! 
! 	if (enable_incrementalsort)
  	{
! 		/*
! 		 * Return the number of path keys in common, or 0 if there are none. Any
! 		 * first common pathkeys could be useful for ordering because we can use
! 		 * incremental sort.
! 		 */
! 		return n_common_pathkeys;
! 	}
! 	else
! 	{
! 		/* 
! 		 * When incremental sort is disabled, pathkeys are useful only when they
! 		 * do contain all the query pathkeys.
! 		 */
! 		if (n_common_pathkeys == list_length(query_pathkeys))
! 			return n_common_pathkeys;
! 		else
! 			return 0;
  	}
  }
  
  /*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
--- 1572,1578 ----
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 52daf43..3632215
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 237,243 ****
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 237,243 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 252,261 ****
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 252,263 ----
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 437,442 ****
--- 439,445 ----
  											   (GatherPath *) best_path);
  			break;
  		case T_Sort:
+ 		case T_IncrementalSort:
  			plan = (Plan *) create_sort_plan(root,
  											 (SortPath *) best_path,
  											 flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1112,1117 ****
--- 1115,1121 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1146,1154 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1150,1160 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1497,1502 ****
--- 1503,1509 ----
  	Plan	   *subplan;
  	List	   *pathkeys = best_path->path.pathkeys;
  	List	   *tlist = build_path_tlist(root, &best_path->path);
+ 	int			n_common_pathkeys;
  
  	/* As with Gather, it's best to project away columns in the workers. */
  	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1523,1534 ****
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
--- 1530,1545 ----
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! 	if (n_common_pathkeys < list_length(pathkeys))
! 	{
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ 									 n_common_pathkeys,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
+ 	}
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1641,1646 ****
--- 1652,1658 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1650,1656 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1662,1672 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1894,1900 ****
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan);
  			}
  
  			if (!rollup->is_hashed)
--- 1910,1917 ----
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan,
! 											 0);
  			}
  
  			if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3830,3837 ****
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3847,3860 ----
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3842,3849 ****
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3865,3878 ----
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4901,4907 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4930,4937 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5490,5502 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node = makeNode(Sort);
! 	Plan	   *plan = &node->plan;
  
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
--- 5520,5550 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node;
! 	Plan	   *plan;
  
+ 	/* Always use regular sort node when enable_incrementalsort = false */
+ 	if (!enable_incrementalsort)
+ 		skipCols = 0;
+ 
+ 	if (skipCols == 0)
+ 	{
+ 		node = makeNode(Sort);
+ 	}
+ 	else
+ 	{
+ 		IncrementalSort    *incrementalSort;
+ 
+ 		incrementalSort = makeNode(IncrementalSort);
+ 		node = &incrementalSort->sort;
+ 		incrementalSort->skipCols = skipCols;
+ 	}
+ 
+ 	plan = &node->plan;
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5829,5835 ****
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5877,5883 ----
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5849,5855 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5897,5903 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5892,5898 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5940,5946 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5913,5919 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5961,5968 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5946,5952 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5995,6001 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** is_projection_capable_plan(Plan *plan)
*** 6597,6602 ****
--- 6646,6652 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 5565736..eaf7a78
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index c4a5651..c1b8eb7
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3755,3768 ****
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				bool		is_sorted;
  
! 				is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 												  path->pathkeys);
! 				if (path == cheapest_partial_path || is_sorted)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (!is_sorted)
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
--- 3755,3768 ----
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				int			n_useful_pathkeys;
  
! 				n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (n_useful_pathkeys < list_length(root->group_pathkeys))
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3835,3848 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3835,3848 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_useful_pathkeys;
  
! 			n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 			if (path == cheapest_path || n_useful_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4909,4921 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 4909,4921 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_useful_pathkeys;
  
! 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! 														 path->pathkeys);
! 		if (path == cheapest_input_path || n_useful_pathkeys > 0)
  		{
! 			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 6044,6051 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 6044,6052 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index c192dc4..92e9923
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 634,639 ****
--- 634,640 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index c1be34d..88143d2
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2701,2706 ****
--- 2701,2707 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_Gather:
  		case T_GatherMerge:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index a1be858..f3f885f
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 973,979 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 973,980 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 2d5caae..eff7ac1
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1297,1308 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1297,1309 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1316,1321 ****
--- 1317,1324 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1552,1558 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1555,1562 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1644,1649 ****
--- 1648,1654 ----
  	GatherMergePath *pathnode = makeNode(GatherMergePath);
  	Cost			 input_startup_cost = 0;
  	Cost			 input_total_cost = 0;
+ 	int				 n_common_pathkeys;
  
  	Assert(subpath->parallel_safe);
  	Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1660,1666 ****
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
--- 1665,1673 ----
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 
! 	if (n_common_pathkeys == list_length(pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1674,1679 ****
--- 1681,1688 ----
  		cost_sort(&sort_path,
  				  root,
  				  pathkeys,
+ 				  n_common_pathkeys,
+ 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  subpath->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2516,2524 ****
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode = makeNode(SortPath);
  
- 	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
--- 2525,2555 ----
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode;
! 	int			n_common_pathkeys;
! 
! 	if (enable_incrementalsort)
! 		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! 	else
! 		n_common_pathkeys = 0;
! 
! 	if (n_common_pathkeys == 0)
! 	{
! 		pathnode = makeNode(SortPath);
! 		pathnode->path.pathtype = T_Sort;
! 	}
! 	else
! 	{
! 		IncrementalSortPath   *incpathnode;
! 
! 		incpathnode = makeNode(IncrementalSortPath);
! 		pathnode = &incpathnode->spath;
! 		pathnode->path.pathtype = T_IncrementalSort;
! 		incpathnode->skipCols = n_common_pathkeys;
! 	}
! 
! 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2532,2538 ****
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2563,2571 ----
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2840,2846 ****
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
--- 2873,2880 ----
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL, 0,
! 						  0.0,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 8502fcf..0af631a
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 277,283 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index a35b93b..885bf43
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3568,3573 ****
--- 3568,3609 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucketsize fraction (ie, number of entries in a bucket
   * divided by total tuples in relation) if the specified expression is used
   * as a hash key.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 587fbce..d2b2596
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 861,866 ****
--- 861,875 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of incremental sort steps."),
+ 			NULL
+ 		},
+ 		&enable_incrementalsort,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hashed aggregation plans."),
  			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 5f62cd5..9822e27
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 282,287 ****
--- 282,294 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	int64		maxSpace;		/* maximum amount of space occupied among sort
+ 								   of groups, either in-memory or on-disk */
+ 	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+ 								   space, fase when it's value for in-memory
+ 								   space */
+ 	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 636,641 ****
--- 643,651 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 670,688 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 680,709 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 699,705 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 720,726 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 717,722 ****
--- 738,744 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 757,769 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 779,792 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 805,811 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 828,834 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 836,842 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 859,865 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 927,933 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 950,956 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 1002,1008 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1025,1031 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1044,1050 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1067,1073 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1155,1170 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1178,1189 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1223,1229 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1242,1339 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	spaceUsed;
! 	bool	spaceUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		spaceUsedOnDisk = true;
! 		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		spaceUsedOnDisk = false;
! 		spaceUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	if (spaceUsed > state->maxSpace)
! 	{
! 		state->maxSpace = spaceUsed;
! 		state->maxSpaceOnDisk = spaceUsedOnDisk;
! 		state->maxSpaceStatus = state->status;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->availMem = state->allowedMem;
! 	state->lastReturnedTuple = NULL;
! 	state->slabAllocatorUsed = false;
! 	state->slabMemoryBegin = NULL;
! 	state->slabMemoryEnd = NULL;
! 	state->slabFreeHead = NULL;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3235,3261 ****
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
  		*spaceType = "Disk";
- 		*spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		*spaceType = "Memory";
! 		*spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3345,3359 ----
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxSpaceOnDisk)
  		*spaceType = "Disk";
  	else
  		*spaceType = "Memory";
! 	*spaceUsed = (state->maxSpace + 1023) / 1024;
  
! 	switch (state->maxSpaceStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...09c5a27
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,25 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncrementalSort.h
+  *
+  *
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/executor/nodeIncrementalSort.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+ 
+ #include "nodes/execnodes.h"
+ 
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node,
+ 													EState *estate, int eflags);
+ extern TupleTableSlot *ExecIncrementalSort(IncrementalSortState *node);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+ 
+ #endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index f289f3c..0b6ff3d
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1692,1697 ****
--- 1692,1711 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1708,1713 ****
--- 1722,1747 ----
  	void	   *tuplesortstate; /* private state of tuplesort.c */
  } SortState;
  
+ /* ----------------
+  *	 IncrementalSortState information
+  * ----------------
+  */
+ typedef struct IncrementalSortState
+ {
+ 	ScanState	ss;				/* its first field is NodeTag */
+ 	bool		bounded;		/* is the result set bounded? */
+ 	int64		bound;			/* if bounded, how many tuples are needed */
+ 	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
+ 	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	int64		bound_Done;		/* value of bound we did the sort with */
+ 	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	int64		groupsCount;	/* number of groups with equal skip keys */
+ 	TupleTableSlot *sampleSlot;	/* slot for sample tuple of sort group */
+ } IncrementalSortState;
+ 
  /* ---------------------
   *	GroupState information
   * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index f59d719..3e76ce3
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
  	T_HashJoin,
  	T_Material,
  	T_Sort,
+ 	T_IncrementalSort,
  	T_Group,
  	T_Agg,
  	T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
  	T_HashJoinState,
  	T_MaterialState,
  	T_SortState,
+ 	T_IncrementalSortState,
  	T_GroupState,
  	T_AggState,
  	T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
  	T_ProjectionPath,
  	T_ProjectSetPath,
  	T_SortPath,
+ 	T_IncrementalSortPath,
  	T_GroupPath,
  	T_UpperUniquePath,
  	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 164105a..f845026
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 751,756 ****
--- 751,767 ----
  	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
  } Sort;
  
+ 
+ /* ----------------
+  *		incremental sort node
+  * ----------------
+  */
+ typedef struct IncrementalSort
+ {
+ 	Sort		sort;
+ 	int			skipCols;		/* number of presorted columns */
+ } IncrementalSort;
+ 
  /* ---------------
   *	 group node -
   *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index adbd3dd..96eebd3
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1419,1424 ****
--- 1419,1434 ----
  } SortPath;
  
  /*
+  * IncrementalSortPath
+  */
+ typedef struct IncrementalSortPath
+ {
+ 	SortPath	spath;
+ 	int			skipCols;
+ } IncrementalSortPath;
+ 
+ 
+ /*
   * GroupPath represents grouping (of presorted input)
   *
   * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index ed70def..47c26c4
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_bitmapscan;
  extern bool enable_tidscan;
  extern bool enable_sort;
+ extern bool enable_incrementalsort;
  extern bool enable_hashagg;
  extern bool enable_nestloop;
  extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 102,109 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 103,111 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 25fe78c..01073dd
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
  extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
  							  List *mergeclauses,
  							  List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
  extern List *truncate_useless_pathkeys(PlannerInfo *root,
  						  RelOptInfo *rel,
  						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 9f9d2dc..b8884b6
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 204,209 ****
--- 204,212 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
  						 double nbuckets);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 14b9026..4ea68e7
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 62,69 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort           
*** 19,27 ****
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Sort           
    Sort Key: id, data
!   ->  Seq Scan on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
--- 19,28 ----
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Incremental Sort
    Sort Key: id, data
!   Presorted Key: id
!   ->  Index Scan using test_dc_pkey on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 6163ed8..9553648
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE:  drop cascades to table matest1
*** 1493,1498 ****
--- 1493,1499 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
  SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1633,1641 ****
--- 1634,1678 ----
   {3,7,8,10,13,13,16,18,19,22}
  (3 rows)
  
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+                                QUERY PLAN                                
+ -------------------------------------------------------------------------
+  Merge Append
+    Sort Key: tenk1.thousand, tenk1.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+    ->  Incremental Sort
+          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
+          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Merge Append
+    Sort Key: a.thousand, a.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+    ->  Incremental Sort
+          Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
+          ->  Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  --
  -- Check that constraint exclusion works correctly with partitions using
  -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!          name         | setting 
! ----------------------+---------
!  enable_bitmapscan    | on
!  enable_gathermerge   | on
!  enable_hashagg       | on
!  enable_hashjoin      | on
!  enable_indexonlyscan | on
!  enable_indexscan     | on
!  enable_material      | on
!  enable_mergejoin     | on
!  enable_nestloop      | on
!  enable_seqscan       | on
!  enable_sort          | on
!  enable_tidscan       | on
! (12 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
--- 70,91 ----
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!           name          | setting 
! ------------------------+---------
!  enable_bitmapscan      | on
!  enable_gathermerge     | on
!  enable_hashagg         | on
!  enable_hashjoin        | on
!  enable_incrementalsort | on
!  enable_indexonlyscan   | on
!  enable_indexscan       | on
!  enable_material        | on
!  enable_mergejoin       | on
!  enable_nestloop        | on
!  enable_seqscan         | on
!  enable_sort            | on
!  enable_tidscan         | on
! (13 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index d43b75c..ec611f5
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 527,532 ****
--- 527,533 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
*************** SELECT
*** 588,596 ****
--- 589,614 ----
      ORDER BY f.i LIMIT 10)
  FROM generate_series(1, 3) g(i);
  
+ set enable_incrementalsort = on;
+ 
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  
  --
  -- Check that constraint exclusion works correctly with partitions using

results.csvtext/csv; charset=US-ASCII; name=results.csvDownload

incsort_test.pytext/x-python-script; charset=US-ASCII; name=incsort_test.pyDownload

#26

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Alexander Korotkov (#25)

Re: [PATCH] Incremental sort

On Fri, May 5, 2017 at 11:13 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Incremental sort is faster in vast majority of cases. It appears to be
slower only when whose dataset is one sort group. In this case incremental
sort is useless, and it should be considered as misuse of incremental sort.
Slowdown is related to the fact that we anyway have to do extra comparisons,
unless we somehow push our comparison result into qsort itself and save some
cpu cycles (but that would be unreasonable break of encapsulation). Thus,
in such cases regression seems to be inevitable anyway. I think we could
evade this regression during query planning. If we see that there would be
only few groups, we should choose plain sort instead of incremental sort.

I'm sorry that I don't have time to review this in detail right now,
but it sounds like you are doing good work to file down cases where
this might cause regressions, which is great. Regarding the point in
the paragraph above, I'd say that it's OK for the planner to be
responsible for picking between Sort and Incremental Sort in some way.
It is, after all, the planner's job to decide between different
strategies for executing the same query and, of course, sometimes it
will be wrong, but that's OK as long as it's not wrong too often (or
by too much, hopefully). It may be a little difficult to get this
right, though, because I'm not sure that the information you need
actually exists (or is reliable). For example, consider the case
where we need to sort 100m rows and there are 2 groups. If 1 group
contains 1 row and the other group contains all of the rest, there is
really no point in an incremental sort. On the other hand, if each
group contains 50m rows and we can get the data presorted by the
grouping column, there might be a lot of point to an incremental sort,
because two 50m-row sorts might be a lot cheaper than one 100m sort.
More generally, it's quite easy to imagine situations where the
individual groups can be quicksorted but sorting all of the rows
requires I/O, even when the number of groups isn't that big. On the
other hand, the real sweet spot for this is probably the case where
the number of groups is very large, with many single-row groups or
many groups with just a few rows each, so if we can at least get this
to work in those cases that may be good enough. On the third hand,
when costing aggregation, I think we often underestimate the number of
groups and there might well be similar problems here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Alexander Korotkov

a.korotkov@postgrespro.ru

over 8 years ago

In reply to: Robert Haas (#26)

1 attachment(s)

Re: [PATCH] Incremental sort

On Mon, May 8, 2017 at 6:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, May 5, 2017 at 11:13 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Incremental sort is faster in vast majority of cases. It appears to be
slower only when whose dataset is one sort group. In this case

incremental

sort is useless, and it should be considered as misuse of incremental

sort.

Slowdown is related to the fact that we anyway have to do extra

comparisons,

unless we somehow push our comparison result into qsort itself and save

some

cpu cycles (but that would be unreasonable break of encapsulation).

Thus,

in such cases regression seems to be inevitable anyway. I think we could
evade this regression during query planning. If we see that there would

be

only few groups, we should choose plain sort instead of incremental sort.

I'm sorry that I don't have time to review this in detail right now,
but it sounds like you are doing good work to file down cases where
this might cause regressions, which is great.

Thank you for paying attention to this patch!

Regarding the point in
the paragraph above, I'd say that it's OK for the planner to be
responsible for picking between Sort and Incremental Sort in some way.
It is, after all, the planner's job to decide between different
strategies for executing the same query and, of course, sometimes it
will be wrong, but that's OK as long as it's not wrong too often (or
by too much, hopefully).

Right, I agree.

It may be a little difficult to get this
right, though, because I'm not sure that the information you need
actually exists (or is reliable). For example, consider the case
where we need to sort 100m rows and there are 2 groups. If 1 group
contains 1 row and the other group contains all of the rest, there is
really no point in an incremental sort. On the other hand, if each
group contains 50m rows and we can get the data presorted by the
grouping column, there might be a lot of point to an incremental sort,
because two 50m-row sorts might be a lot cheaper than one 100m sort.

More generally, it's quite easy to imagine situations where the

individual groups can be quicksorted but sorting all of the rows
requires I/O, even when the number of groups isn't that big. On the
other hand, the real sweet spot for this is probably the case where
the number of groups is very large, with many single-row groups or
many groups with just a few rows each, so if we can at least get this
to work in those cases that may be good enough. On the third hand,
when costing aggregation, I think we often underestimate the number of
groups and there might well be similar problems here.

I agree with that. I need to test this patch more carefully in the case
when groups have different sizes. It's likely I need to add yet another
parameter to my testing script: groups count skew.

Patch rebased to current master is attached. I'm going to improve my
testing script and post new results.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-8.patchapplication/octet-stream; name=incremental-sort-8.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index c19b331..38c7e11
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1981,2019 ****
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
   Limit
!    Output: t1.c1, t2.c1
     ->  Sort
!          Output: t1.c1, t2.c1
!          Sort Key: t1.c1, t2.c1
           ->  Nested Loop
!                Output: t1.c1, t2.c1
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c1
!                      Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c1
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c1
!                            Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!  c1 | c1  
! ----+-----
!   1 | 101
!   1 | 102
!   1 | 103
!   1 | 104
!   1 | 105
!   1 | 106
!   1 | 107
!   1 | 108
!   1 | 109
!   1 | 110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
--- 1981,2019 ----
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!                             QUERY PLAN                            
! ------------------------------------------------------------------
   Limit
!    Output: t1.c3, t2.c3
     ->  Sort
!          Output: t1.c3, t2.c3
!          Sort Key: t1.c3, t2.c3
           ->  Nested Loop
!                Output: t1.c3, t2.c3
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c3
!                      Remote SQL: SELECT c3 FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c3
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c3
!                            Remote SQL: SELECT c3 FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!   c3   |  c3   
! -------+-------
!  00001 | 00101
!  00001 | 00102
!  00001 | 00103
!  00001 | 00104
!  00001 | 00105
!  00001 | 00106
!  00001 | 00107
!  00001 | 00108
!  00001 | 00109
!  00001 | 00110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 5f65d9d..5dc7a24
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 5f59a38..ac9c9f0
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3591,3596 ****
--- 3591,3610 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+       <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of incremental sort
+         steps. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
        <term><varname>enable_indexscan</varname> (<type>boolean</type>)
        <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 4cee357..56aaa6f
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual, 
*** 80,85 ****
--- 80,87 ----
  				ExplainState *es);
  static void show_sort_keys(SortState *sortstate, List *ancestors,
  			   ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 					   List *ancestors, ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
  static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
  				 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 									   ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  					ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1016,1021 ****
--- 1020,1028 ----
  		case T_Sort:
  			pname = sname = "Sort";
  			break;
+ 		case T_IncrementalSort:
+ 			pname = sname = "Incremental Sort";
+ 			break;
  		case T_Group:
  			pname = sname = "Group";
  			break;
*************** ExplainNode(PlanState *planstate, List *
*** 1606,1611 ****
--- 1613,1624 ----
  			show_sort_keys(castNode(SortState, planstate), ancestors, es);
  			show_sort_info(castNode(SortState, planstate), es);
  			break;
+ 		case T_IncrementalSort:
+ 			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ 									   ancestors, es);
+ 			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ 									   es);
+ 			break;
  		case T_MergeAppend:
  			show_merge_append_keys(castNode(MergeAppendState, planstate),
  								   ancestors, es);
*************** static void
*** 1931,1945 ****
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
   * Likewise, for a MergeAppend node.
   */
  static void
--- 1944,1981 ----
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+ 	int			skipCols;
+ 
+ 	if (IsA(plan, IncrementalSort))
+ 		skipCols = ((IncrementalSort *) plan)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
+  * Show the sort keys for a IncrementalSort node.
+  */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 						   List *ancestors, ExplainState *es)
+ {
+ 	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+ 
+ 	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ 						 plan->sort.numCols, plan->skipCols,
+ 						 plan->sort.sortColIdx,
+ 						 plan->sort.sortOperators, plan->sort.collations,
+ 						 plan->sort.nullsFirst,
+ 						 ancestors, es);
+ }
+ 
+ /*
   * Likewise, for a MergeAppend node.
   */
  static void
*************** show_merge_append_keys(MergeAppendState 
*** 1949,1955 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1985,1991 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1973,1979 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 2009,2015 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 2042,2048 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 2078,2084 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2099,2105 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 2135,2141 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2112,2124 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 2148,2161 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2158,2166 ****
--- 2195,2207 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2369,2374 ****
--- 2410,2504 ----
  }
  
  /*
+  * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+  */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 						   ExplainState *es)
+ {
+ 	if (es->analyze && incrsortstate->sort_Done &&
+ 		incrsortstate->tuplesortstate != NULL)
+ 	{
+ 		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ 		TuplesortInstrumentation stats;
+ 		const char *sortMethod;
+ 		const char *spaceType;
+ 		long		spaceUsed;
+ 
+ 		tuplesort_get_stats(state, &stats);
+ 		sortMethod = tuplesort_method_name(stats.sortMethod);
+ 		spaceType = tuplesort_space_type_name(stats.spaceType);
+ 		spaceUsed = stats.spaceUsed;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+ 							 sortMethod, spaceType, spaceUsed);
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Groups: %ld\n",
+ 							 incrsortstate->groupsCount);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyText("Sort Method", sortMethod, es);
+ 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			ExplainPropertyLong("Sort Groups: %ld",
+ 								incrsortstate->groupsCount, es);
+ 		}
+ 	}
+ 
+ 	if (incrsortstate->shared_info != NULL)
+ 	{
+ 		int			n;
+ 		bool		opened_group = false;
+ 
+ 		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ 		{
+ 			TuplesortInstrumentation *sinstrument;
+ 			const char *sortMethod;
+ 			const char *spaceType;
+ 			long		spaceUsed;
+ 			int64		groupsCount;
+ 
+ 			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ 			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ 			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ 				continue;		/* ignore any unfilled slots */
+ 			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ 			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ 			spaceUsed = sinstrument->spaceUsed;
+ 
+ 			if (es->format == EXPLAIN_FORMAT_TEXT)
+ 			{
+ 				appendStringInfoSpaces(es->str, es->indent * 2);
+ 				appendStringInfo(es->str,
+ 								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+ 								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+ 			}
+ 			else
+ 			{
+ 				if (!opened_group)
+ 				{
+ 					ExplainOpenGroup("Workers", "Workers", false, es);
+ 					opened_group = true;
+ 				}
+ 				ExplainOpenGroup("Worker", NULL, true, es);
+ 				ExplainPropertyInteger("Worker Number", n, es);
+ 				ExplainPropertyText("Sort Method", sortMethod, es);
+ 				ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 				ExplainPropertyText("Sort Space Type", spaceType, es);
+ 				ExplainPropertyLong("Sort Groups", groupsCount, es);
+ 				ExplainCloseGroup("Worker", NULL, true, es);
+ 			}
+ 		}
+ 		if (opened_group)
+ 			ExplainCloseGroup("Workers", "Workers", false, es);
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 083b20f..b093618
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
!        nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
!        nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index f1636a5..dd8cffe
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 31,36 ****
--- 31,37 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 253,258 ****
--- 254,263 ----
  			ExecReScanSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecReScanIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecReScanGroup((GroupState *) node);
  			break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 525,532 ****
--- 530,541 ----
  		case T_CteScan:
  		case T_Material:
  		case T_Sort:
+ 			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_IncrementalSort:
+ 			return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index 8737cc1..2c8fa93
*** a/src/backend/executor/execParallel.c
--- b/src/backend/executor/execParallel.c
***************
*** 28,33 ****
--- 28,34 ----
  #include "executor/nodeBitmapHeapscan.h"
  #include "executor/nodeCustom.h"
  #include "executor/nodeForeignscan.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeSeqscan.h"
*************** ExecParallelEstimate(PlanState *planstat
*** 258,263 ****
--- 259,268 ----
  			/* even when not parallel-aware */
  			ExecSortEstimate((SortState *) planstate, e->pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelInitializeDSM(PlanState *pla
*** 330,335 ****
--- 335,344 ----
  			/* even when not parallel-aware */
  			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelReInitializeDSM(PlanState *p
*** 706,711 ****
--- 715,724 ----
  			/* even when not parallel-aware */
  			ExecSortReInitializeDSM((SortState *) planstate, pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelRetrieveInstrumentation(Plan
*** 764,769 ****
--- 777,784 ----
  	 */
  	if (IsA(planstate, SortState))
  		ExecSortRetrieveInstrumentation((SortState *) planstate);
+ 	else if (IsA(planstate, IncrementalSortState))
+ 		ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
  
  	return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
  								 instrumentation);
*************** ExecParallelInitializeWorker(PlanState *
*** 985,990 ****
--- 1000,1009 ----
  			/* even when not parallel-aware */
  			ExecSortInitializeWorker((SortState *) planstate, toc);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate, toc);
+ 			break;
  
  		default:
  			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index c1aa506..e4225df
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 88,93 ****
--- 88,94 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 314,319 ****
--- 315,325 ----
  												estate, eflags);
  			break;
  
+ 		case T_IncrementalSort:
+ 			result = (PlanState *) ExecInitIncrementalSort(
+ 									(IncrementalSort *) node, estate, eflags);
+ 			break;
+ 
  		case T_Group:
  			result = (PlanState *) ExecInitGroup((Group *) node,
  												 estate, eflags);
*************** ExecEndNode(PlanState *node)
*** 679,684 ****
--- 685,694 ----
  			ExecEndSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecEndIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecEndGroup((GroupState *) node);
  			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index 0ae5873..dab5a1e
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 655,660 ****
--- 655,661 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 742,748 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 743,749 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...04059cc
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,644 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncremenalSort.c
+  *	  Routines to handle incremental sorting of relations.
+  *
+  * DESCRIPTION
+  *
+  *		Incremental sort is specially optimized kind of multikey sort when
+  *		input is already presorted by prefix of required keys list.  Thus,
+  *		when it's required to sort by (key1, key2 ... keyN) and result is
+  *		already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+  *		values of (key1, key2 ... keyM) are equal.
+  *
+  *		Consider following example.  We have input tuples consisting from
+  *		two integers (x, y) already presorted by x, while it's required to
+  *		sort them by x and y.  Let input tuples be following.
+  *
+  *		(1, 5)
+  *		(1, 2)
+  *		(2, 10)
+  *		(2, 1)
+  *		(2, 5)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort algorithm would sort by xfollowing groups, which have
+  *		equal x, individually:
+  *			(1, 5) (1, 2)
+  *			(2, 10) (2, 1) (2, 5)
+  *			(3, 3) (3, 7)
+  *
+  *		After sorting these groups and putting them altogether, we would get
+  *		following tuple set which is actually sorted by x and y.
+  *
+  *		(1, 2)
+  *		(1, 5)
+  *		(2, 1)
+  *		(2, 5)
+  *		(2, 10)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort is faster than full sort on large datasets.  But
+  *		the case of most huge benefit of incremental sort is queries with
+  *		LIMIT because incremental sort can return first tuples without reading
+  *		whole input dataset.
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/executor/nodeIncremenalSort.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+ 
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ 															TupleTableSlot *b)
+ {
+ 	int n, i;
+ 
+ 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+ 
+ 	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = slot_getattr(a, attno, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	int					skipCols,
+ 						i;
+ 
+ 	Assert(IsA(plannode, IncrementalSort));
+ 	skipCols = plannode->skipCols;
+ 
+ 	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sort.sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 										plannode->sort.sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sort.sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->sort.collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
+ 
+ #define MIN_GROUP_SIZE 32
+ 
+ /* ----------------------------------------------------------------
+  *		ExecIncrementalSort
+  *
+  *		Assuming that outer subtree returns tuple presorted by some prefix
+  *		of target sort columns, performs incremental sort.  It fetches
+  *		groups of tuples where prefix sort columns are equal and sorts them
+  *		using tuplesort.  This approach allows to evade sorting of whole
+  *		dataset.  Besides taking less memory and being faster, it allows to
+  *		start returning tuples before fetching full dataset from outer
+  *		subtree.
+  *
+  *		Conditions:
+  *		  -- none.
+  *
+  *		Initial States:
+  *		  -- the outer child is prepared to return the first tuple.
+  * ----------------------------------------------------------------
+  */
+ static TupleTableSlot *
+ ExecIncrementalSort(PlanState *pstate)
+ {
+ 	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ 	EState			   *estate;
+ 	ScanDirection		dir;
+ 	Tuplesortstate	   *tuplesortstate;
+ 	TupleTableSlot	   *slot;
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	PlanState		   *outerNode;
+ 	TupleDesc			tupDesc;
+ 	int64				nTuples = 0;
+ 
+ 	/*
+ 	 * get state info from node
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "entering routine");
+ 
+ 	estate = node->ss.ps.state;
+ 	dir = estate->es_direction;
+ 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+ 
+ 	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  false, slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
+ 	 * If first time through, read all tuples from outer plan and pass them to
+ 	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ 	 */
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "sorting subplan");
+ 
+ 	/*
+ 	 * Want to scan subplan in the forward direction while creating the
+ 	 * sorted data.
+ 	 */
+ 	estate->es_direction = ForwardScanDirection;
+ 
+ 	/*
+ 	 * Initialize tuplesort module.
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "calling tuplesort_begin");
+ 
+ 	outerNode = outerPlanState(node);
+ 	tupDesc = ExecGetResultType(outerNode);
+ 
+ 	if (node->tuplesortstate == NULL)
+ 	{
+ 		/*
+ 		 * We are going to process the first group of presorted data.
+ 		 * Initialize support structures for cmpSortSkipCols - already
+ 		 * sorted columns.
+ 		 */
+ 		prepareSkipCols(node);
+ 
+ 		/*
+ 		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+ 		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+ 		 * necessary have equal value of the first column.  We unlikely will
+ 		 * have huge groups with incremental sort.  Therefore usage of
+ 		 * abbreviated keys would be likely a waste of time.
+ 		 */
+ 		tuplesortstate = tuplesort_begin_heap(
+ 									tupDesc,
+ 									plannode->sort.numCols,
+ 									plannode->sort.sortColIdx,
+ 									plannode->sort.sortOperators,
+ 									plannode->sort.collations,
+ 									plannode->sort.nullsFirst,
+ 									work_mem,
+ 									false,
+ 									true);
+ 		node->tuplesortstate = (void *) tuplesortstate;
+ 		node->groupsCount++;
+ 	}
+ 	else
+ 	{
+ 		/* Next group of presorted data */
+ 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ 		node->groupsCount++;
+ 	}
+ 
+ 	/* Calculate remaining bound for bounded sort */
+ 	if (node->bounded)
+ 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ 
+ 	/* Put saved tuple to tuplesort if any */
+ 	if (!TupIsNull(node->sampleSlot))
+ 	{
+ 		tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ 		ExecClearTuple(node->sampleSlot);
+ 		nTuples++;
+ 	}
+ 
+ 	/*
+ 	 * Put next group of tuples where skipCols sort values are equal to
+ 	 * tuplesort.
+ 	 */
+ 	for (;;)
+ 	{
+ 		slot = ExecProcNode(outerNode);
+ 
+ 		if (TupIsNull(slot))
+ 		{
+ 			node->finished = true;
+ 			break;
+ 		}
+ 
+ 		/* Put next group of presorted data to the tuplesort */
+ 		if (nTuples < MIN_GROUP_SIZE)
+ 		{
+ 			tuplesort_puttupleslot(tuplesortstate, slot);
+ 
+ 			/* Save last tuple in minimal group */
+ 			if (nTuples == MIN_GROUP_SIZE - 1)
+ 				ExecCopySlot(node->sampleSlot, slot);
+ 			nTuples++;
+ 		}
+ 		else
+ 		{
+ 			/* Interate while skip cols are same as in saved tuple */
+ 			bool	cmp;
+ 			cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+ 
+ 			if (cmp)
+ 			{
+ 				tuplesort_puttupleslot(tuplesortstate, slot);
+ 				nTuples++;
+ 			}
+ 			else
+ 			{
+ 				ExecCopySlot(node->sampleSlot, slot);
+ 				break;
+ 			}
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Complete the sort.
+ 	 */
+ 	tuplesort_performsort(tuplesortstate);
+ 
+ 	/*
+ 	 * restore to user specified direction
+ 	 */
+ 	estate->es_direction = dir;
+ 
+ 	/*
+ 	 * finally set the sorted flag to true
+ 	 */
+ 	node->sort_Done = true;
+ 	node->bounded_Done = node->bounded;
+ 	if (node->shared_info && node->am_worker)
+ 	{
+ 		TuplesortInstrumentation *si;
+ 
+ 		Assert(IsParallelWorker());
+ 		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ 		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ 		tuplesort_get_stats(tuplesortstate, si);
+ 		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ 															node->groupsCount;
+ 	}
+ 
+ 	/*
+ 	 * Adjust bound_Done with number of tuples we've actually sorted.
+ 	 */
+ 	if (node->bounded)
+ 	{
+ 		if (node->finished)
+ 			node->bound_Done = node->bound;
+ 		else
+ 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ 	}
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "retrieving tuple from tuplesort");
+ 
+ 	/*
+ 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+ 	 * tuples.
+ 	 */
+ 	slot = node->ss.ps.ps_ResultTupleSlot;
+ 	(void) tuplesort_gettupleslot(tuplesortstate,
+ 								  ScanDirectionIsForward(dir),
+ 								  false, slot, NULL);
+ 	return slot;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitIncrementalSort
+  *
+  *		Creates the run-time state information for the sort node
+  *		produced by the planner and initializes its outer subtree.
+  * ----------------------------------------------------------------
+  */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ 	IncrementalSortState   *incrsortstate;
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "initializing sort node");
+ 
+ 	/*
+ 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ 	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ 	 * bucket in tuplesortstate.
+ 	 */
+ 	Assert((eflags & (EXEC_FLAG_REWIND |
+ 					  EXEC_FLAG_BACKWARD |
+ 					  EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
+ 	 * create state structure
+ 	 */
+ 	incrsortstate = makeNode(IncrementalSortState);
+ 	incrsortstate->ss.ps.plan = (Plan *) node;
+ 	incrsortstate->ss.ps.state = estate;
+ 	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+ 
+ 	incrsortstate->bounded = false;
+ 	incrsortstate->sort_Done = false;
+ 	incrsortstate->finished = false;
+ 	incrsortstate->tuplesortstate = NULL;
+ 	incrsortstate->sampleSlot = NULL;
+ 	incrsortstate->bound_Done = 0;
+ 	incrsortstate->groupsCount = 0;
+ 	incrsortstate->skipKeys = NULL;
+ 
+ 	/*
+ 	 * Miscellaneous initialization
+ 	 *
+ 	 * Sort nodes don't initialize their ExprContexts because they never call
+ 	 * ExecQual or ExecProject.
+ 	 */
+ 
+ 	/*
+ 	 * tuple table initialization
+ 	 *
+ 	 * sort nodes only return scan tuples from their sorted relation.
+ 	 */
+ 	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ 	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+ 
+ 	/*
+ 	 * initialize child nodes
+ 	 *
+ 	 * We shield the child node from the need to support REWIND, BACKWARD, or
+ 	 * MARK/RESTORE.
+ 	 */
+ 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+ 
+ 	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+ 
+ 	/*
+ 	 * initialize tuple type.  no need to initialize projection info because
+ 	 * this node doesn't do projections.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ 	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+ 
+ 	/* make standalone slot to store previous tuple from outer node */
+ 	incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ 							ExecGetResultType(outerPlanState(incrsortstate)));
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "sort node initialized");
+ 
+ 	return incrsortstate;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecEndIncrementalSort(node)
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "shutting down sort node");
+ 
+ 	/*
+ 	 * clean out the tuple table
+ 	 */
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	/* must drop stanalone tuple slot from outer node */
+ 	ExecDropSingleTupleTableSlot(node->sampleSlot);
+ 
+ 	/*
+ 	 * Release tuplesort resources
+ 	 */
+ 	if (node->tuplesortstate != NULL)
+ 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 
+ 	/*
+ 	 * shut down the subplan
+ 	 */
+ 	ExecEndNode(outerPlanState(node));
+ 
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "sort node shutdown");
+ }
+ 
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ 	PlanState  *outerPlan = outerPlanState(node);
+ 
+ 	/*
+ 	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ 	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ 	 * re-scan it at all.
+ 	 */
+ 	if (!node->sort_Done)
+ 		return;
+ 
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 
+ 	/*
+ 	 * If subnode is to be rescanned then we forget previous sort results; we
+ 	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+ 	 * bounded-sort parameters changed or we didn't select randomAccess.
+ 	 *
+ 	 * Otherwise we can just rewind and rescan the sorted output.
+ 	 */
+ 	node->sort_Done = false;
+ 	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 	node->bound_Done = 0;
+ 
+ 	/*
+ 	 * if chgParam of subnode is not null then plan will be re-scanned by
+ 	 * first ExecProcNode.
+ 	 */
+ 	if (outerPlan->chgParam == NULL)
+ 		ExecReScan(outerPlan);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *						Parallel Query Support
+  * ----------------------------------------------------------------
+  */
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortEstimate
+  *
+  *		Estimate space required to propagate sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	Size		size;
+ 
+ 	/* don't need this if not instrumenting or no workers */
+ 	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ 		return;
+ 
+ 	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ 	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ 	shm_toc_estimate_chunk(&pcxt->estimator, size);
+ 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortInitializeDSM
+  *
+  *		Initialize DSM space for sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	Size		size;
+ 
+ 	/* don't need this if not instrumenting or no workers */
+ 	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ 		return;
+ 
+ 	size = offsetof(SharedIncrementalSortInfo, sinfo)
+ 		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+ 	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ 	/* ensure any unfilled slots will contain zeroes */
+ 	memset(node->shared_info, 0, size);
+ 	node->shared_info->num_workers = pcxt->nworkers;
+ 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ 				   node->shared_info);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortReInitializeDSM
+  *
+  *		Reset shared state before beginning a fresh scan.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	/* If there's any instrumentation space, clear it for next time */
+ 	if (node->shared_info != NULL)
+ 	{
+ 		memset(node->shared_info->sinfo, 0,
+ 			   node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ 	}
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortInitializeWorker
+  *
+  *		Attach worker to DSM space for sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortInitializeWorker(IncrementalSortState *node, shm_toc *toc)
+ {
+ 	node->shared_info =
+ 		shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id, true);
+ 	node->am_worker = true;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortRetrieveInstrumentation
+  *
+  *		Transfer sort statistics from DSM to private memory.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+ {
+ 	Size		size;
+ 	SharedIncrementalSortInfo *si;
+ 
+ 	if (node->shared_info == NULL)
+ 		return;
+ 
+ 	size = offsetof(SharedIncrementalSortInfo, sinfo)
+ 		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ 	si = palloc(size);
+ 	memcpy(si, node->shared_info, size);
+ 	node->shared_info = si;
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 98bcaeb..2bddf63
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(PlanState *pstate)
*** 93,99 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
--- 93,100 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index f1bed14..0082db3
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 917,922 ****
--- 917,940 ----
  
  
  /*
+  * CopySortFields
+  *
+  *		This function copies the fields of the Sort node.  It is used by
+  *		all the copy functions for classes which inherit from Sort.
+  */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ 
+ 	COPY_SCALAR_FIELD(numCols);
+ 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+ 
+ /*
   * _copySort
   */
  static Sort *
*************** _copySort(const Sort *from)
*** 927,939 ****
  	/*
  	 * copy node superclass fields
  	 */
! 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
! 	COPY_SCALAR_FIELD(numCols);
! 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
  
  	return newnode;
  }
--- 945,973 ----
  	/*
  	 * copy node superclass fields
  	 */
! 	CopySortFields(from, newnode);
  
! 	return newnode;
! }
! 
! 
! /*
!  * _copyIncrementalSort
!  */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! 	IncrementalSort	   *newnode = makeNode(IncrementalSort);
! 
! 	/*
! 	 * copy node superclass fields
! 	 */
! 	CopySortFields((const Sort *) from, (Sort *) newnode);
! 
! 	/*
! 	 * copy remainder of node
! 	 */
! 	COPY_SCALAR_FIELD(skipCols);
  
  	return newnode;
  }
*************** copyObjectImpl(const void *from)
*** 4789,4794 ****
--- 4823,4831 ----
  		case T_Sort:
  			retval = _copySort(from);
  			break;
+ 		case T_IncrementalSort:
+ 			retval = _copyIncrementalSort(from);
+ 			break;
  		case T_Group:
  			retval = _copyGroup(from);
  			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index b83d919..8619847
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 861,872 ****
  }
  
  static void
! _outSort(StringInfo str, const Sort *node)
  {
  	int			i;
  
- 	WRITE_NODE_TYPE("SORT");
- 
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
--- 861,870 ----
  }
  
  static void
! _outSortInfo(StringInfo str, const Sort *node)
  {
  	int			i;
  
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 889,894 ****
--- 887,910 ----
  }
  
  static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ 	WRITE_NODE_TYPE("SORT");
+ 
+ 	_outSortInfo(str, node);
+ }
+ 
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ 	WRITE_NODE_TYPE("INCREMENTALSORT");
+ 
+ 	_outSortInfo(str, (const Sort *) node);
+ 
+ 	WRITE_INT_FIELD(skipCols);
+ }
+ 
+ static void
  _outUnique(StringInfo str, const Unique *node)
  {
  	int			i;
*************** outNode(StringInfo str, const void *obj)
*** 3728,3733 ****
--- 3744,3752 ----
  			case T_Sort:
  				_outSort(str, obj);
  				break;
+ 			case T_IncrementalSort:
+ 				_outIncrementalSort(str, obj);
+ 				break;
  			case T_Unique:
  				_outUnique(str, obj);
  				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index fbf8330..5fdba3a
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2053,2064 ****
  }
  
  /*
!  * _readSort
   */
! static Sort *
! _readSort(void)
  {
! 	READ_LOCALS(Sort);
  
  	ReadCommonPlan(&local_node->plan);
  
--- 2053,2065 ----
  }
  
  /*
!  * ReadCommonSort
!  *	Assign the basic stuff of all nodes that inherit from Sort
   */
! static void
! ReadCommonSort(Sort *local_node)
  {
! 	READ_TEMP_LOCALS();
  
  	ReadCommonPlan(&local_node->plan);
  
*************** _readSort(void)
*** 2067,2072 ****
--- 2068,2099 ----
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
  	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+ 
+ /*
+  * _readSort
+  */
+ static Sort *
+ _readSort(void)
+ {
+ 	READ_LOCALS_NO_FIELDS(Sort);
+ 
+ 	ReadCommonSort(local_node);
+ 
+ 	READ_DONE();
+ }
+ 
+ /*
+  * _readIncrementalSort
+  */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ 	READ_LOCALS(IncrementalSort);
+ 
+ 	ReadCommonSort(&local_node->sort);
+ 
+ 	READ_INT_FIELD(skipCols);
  
  	READ_DONE();
  }
*************** parseNodeString(void)
*** 2624,2629 ****
--- 2651,2658 ----
  		return_value = _readMaterial();
  	else if (MATCH("SORT", 4))
  		return_value = _readSort();
+ 	else if (MATCH("INCREMENTALSORT", 7))
+ 		return_value = _readIncrementalSort();
  	else if (MATCH("GROUP", 5))
  		return_value = _readGroup();
  	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 2d7e1d8..010fc2c
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3281,3286 ****
--- 3281,3290 ----
  			ptype = "Sort";
  			subpath = ((SortPath *) path)->subpath;
  			break;
+ 		case T_IncrementalSortPath:
+ 			ptype = "IncrementalSort";
+ 			subpath = ((SortPath *) path)->subpath;
+ 			break;
  		case T_GroupPath:
  			ptype = "Group";
  			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 051a854..f779ef9
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
  bool		enable_bitmapscan = true;
  bool		enable_tidscan = true;
  bool		enable_sort = true;
+ bool		enable_incrementalsort = true;
  bool		enable_hashagg = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
*************** cost_recursive_union(Path *runion, Path 
*** 1600,1605 ****
--- 1601,1613 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or incremental sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1626,1632 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1634,1641 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1642,1660 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
  
  	path->rows = tuples;
  
--- 1651,1678 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
+ 	if (!enable_incrementalsort)
+ 		presorted_keys = false;
  
  	path->rows = tuples;
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1680,1692 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1698,1747 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1696,1702 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1751,1757 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1707,1716 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1762,1771 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1718,1731 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1773,1805 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/*
! 		 * We'll use plain quicksort on all the input tuples.  If it appears
! 		 * that we expect less than two tuples per sort group then assume
! 		 * logarithmic part of estimate to be 1.
! 		 */
! 		if (group_tuples >= 2.0)
! 			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! 		else
! 			group_cost = comparison_cost * group_tuples;
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1736,1741 ****
--- 1810,1828 ----
  	 */
  	run_cost += cpu_operator_cost * tuples;
  
+ 	/* Extra costs of incremental sort */
+ 	if (presorted_keys > 0)
+ 	{
+ 		/*
+ 		 * In incremental sort case we also have to cost to detect sort groups.
+ 		 * It turns out into extra copy and comparison for each tuple.
+ 		 */
+ 		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+ 
+ 		/* Cost of per group tuplesort reset */
+ 		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ 	}
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2499,2504 ****
--- 2586,2593 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2525,2530 ****
--- 2614,2621 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 9d83a5c..910f285
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
  	return PATHKEYS_EQUAL;
  }
  
+ 
+ /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
  /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 402,413 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied only partially then we would have to do
!  *	  incremental sort in order to satisfy pathkeys completely.  Since
!  *	  incremental sort consumes data by presorted groups, we would have to
!  *	  consume more data than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
! 	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
- 
- 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1521,1562 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by incremental sort.
   */
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
  {
! 	int	n_common_pathkeys;
! 
! 	if (query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
! 
! 	if (enable_incrementalsort)
  	{
! 		/*
! 		 * Return the number of path keys in common, or 0 if there are none. Any
! 		 * first common pathkeys could be useful for ordering because we can use
! 		 * incremental sort.
! 		 */
! 		return n_common_pathkeys;
! 	}
! 	else
! 	{
! 		/* 
! 		 * When incremental sort is disabled, pathkeys are useful only when they
! 		 * do contain all the query pathkeys.
! 		 */
! 		if (n_common_pathkeys == list_length(query_pathkeys))
! 			return n_common_pathkeys;
! 		else
! 			return 0;
  	}
  }
  
  /*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
--- 1572,1578 ----
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 2821662..4c5d14f
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 235,241 ****
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 235,241 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 250,259 ****
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 250,261 ----
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 435,440 ****
--- 437,443 ----
  											   (GatherPath *) best_path);
  			break;
  		case T_Sort:
+ 		case T_IncrementalSort:
  			plan = (Plan *) create_sort_plan(root,
  											 (SortPath *) best_path,
  											 flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1110,1115 ****
--- 1113,1119 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1144,1152 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1148,1158 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1496,1501 ****
--- 1502,1508 ----
  	Plan	   *subplan;
  	List	   *pathkeys = best_path->path.pathkeys;
  	List	   *tlist = build_path_tlist(root, &best_path->path);
+ 	int			n_common_pathkeys;
  
  	/* As with Gather, it's best to project away columns in the workers. */
  	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1525,1536 ****
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
--- 1532,1547 ----
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! 	if (n_common_pathkeys < list_length(pathkeys))
! 	{
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ 									 n_common_pathkeys,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
+ 	}
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1643,1648 ****
--- 1654,1660 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1652,1658 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1664,1674 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1896,1902 ****
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan);
  			}
  
  			if (!rollup->is_hashed)
--- 1912,1919 ----
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan,
! 											 0);
  			}
  
  			if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3834,3841 ****
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3851,3864 ----
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3846,3853 ****
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3869,3882 ----
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4899,4905 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4928,4935 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5484,5496 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node = makeNode(Sort);
! 	Plan	   *plan = &node->plan;
  
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
--- 5514,5544 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node;
! 	Plan	   *plan;
  
+ 	/* Always use regular sort node when enable_incrementalsort = false */
+ 	if (!enable_incrementalsort)
+ 		skipCols = 0;
+ 
+ 	if (skipCols == 0)
+ 	{
+ 		node = makeNode(Sort);
+ 	}
+ 	else
+ 	{
+ 		IncrementalSort    *incrementalSort;
+ 
+ 		incrementalSort = makeNode(IncrementalSort);
+ 		node = &incrementalSort->sort;
+ 		incrementalSort->skipCols = skipCols;
+ 	}
+ 
+ 	plan = &node->plan;
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5823,5829 ****
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5871,5877 ----
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5843,5849 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5891,5897 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5886,5892 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5934,5940 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5907,5913 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5955,5962 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5940,5946 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5989,5995 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** is_projection_capable_plan(Plan *plan)
*** 6596,6601 ****
--- 6645,6651 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index bba8a1f..eca8561
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 6b79b3a..e239217
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3769,3782 ****
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				bool		is_sorted;
  
! 				is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 												  path->pathkeys);
! 				if (path == cheapest_partial_path || is_sorted)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (!is_sorted)
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
--- 3769,3782 ----
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				int			n_useful_pathkeys;
  
! 				n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (n_useful_pathkeys < list_length(root->group_pathkeys))
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3849,3862 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3849,3862 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_useful_pathkeys;
  
! 			n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 			if (path == cheapest_path || n_useful_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4923,4935 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 4923,4935 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_useful_pathkeys;
  
! 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! 														 path->pathkeys);
! 		if (path == cheapest_input_path || n_useful_pathkeys > 0)
  		{
! 			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 6058,6065 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 6058,6066 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index b0c9e94..65d44e7
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 634,639 ****
--- 634,640 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 1103984..8278316
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2765,2770 ****
--- 2765,2771 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index ccf2145..e6c5600
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 989,995 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 989,996 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 26567cb..ef03c21
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1296,1307 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1296,1308 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1315,1320 ****
--- 1316,1323 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1551,1557 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1554,1561 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1643,1648 ****
--- 1647,1653 ----
  	GatherMergePath *pathnode = makeNode(GatherMergePath);
  	Cost		input_startup_cost = 0;
  	Cost		input_total_cost = 0;
+ 	int			n_common_pathkeys;
  
  	Assert(subpath->parallel_safe);
  	Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1659,1665 ****
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
--- 1664,1672 ----
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 
! 	if (n_common_pathkeys == list_length(pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1673,1678 ****
--- 1680,1687 ----
  		cost_sort(&sort_path,
  				  root,
  				  pathkeys,
+ 				  n_common_pathkeys,
+ 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  subpath->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2516,2524 ****
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode = makeNode(SortPath);
  
- 	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
--- 2525,2555 ----
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode;
! 	int			n_common_pathkeys;
! 
! 	if (enable_incrementalsort)
! 		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! 	else
! 		n_common_pathkeys = 0;
! 
! 	if (n_common_pathkeys == 0)
! 	{
! 		pathnode = makeNode(SortPath);
! 		pathnode->path.pathtype = T_Sort;
! 	}
! 	else
! 	{
! 		IncrementalSortPath   *incpathnode;
! 
! 		incpathnode = makeNode(IncrementalSortPath);
! 		pathnode = &incpathnode->spath;
! 		pathnode->path.pathtype = T_IncrementalSort;
! 		incpathnode->skipCols = n_common_pathkeys;
! 	}
! 
! 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2532,2538 ****
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2563,2571 ----
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2840,2846 ****
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
--- 2873,2880 ----
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL, 0,
! 						  0.0,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 25905a3..6d165be
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 277,283 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 277,283 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index db1792b..3cb1ded
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3641,3646 ****
--- 3641,3682 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucket statistics when the specified expression is used
   * as a hash key for the given number of buckets.
   *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index bc9f09a..f7ab820
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 862,867 ****
--- 862,876 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of incremental sort steps."),
+ 			NULL
+ 		},
+ 		&enable_incrementalsort,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hashed aggregation plans."),
  			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 17e1b68..f331d88
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 282,287 ****
--- 282,294 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	int64		maxSpace;		/* maximum amount of space occupied among sort
+ 								   of groups, either in-memory or on-disk */
+ 	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+ 								   space, fase when it's value for in-memory
+ 								   space */
+ 	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 636,641 ****
--- 643,651 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 670,688 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 680,709 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 699,705 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 720,726 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 717,722 ****
--- 738,744 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 757,769 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 779,792 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 805,811 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 828,834 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 836,842 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 859,865 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 927,933 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 950,956 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 1002,1008 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1025,1031 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1044,1050 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1067,1073 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1155,1170 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1178,1189 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1223,1229 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1242,1339 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	spaceUsed;
! 	bool	spaceUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		spaceUsedOnDisk = true;
! 		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		spaceUsedOnDisk = false;
! 		spaceUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	if (spaceUsed > state->maxSpace)
! 	{
! 		state->maxSpace = spaceUsed;
! 		state->maxSpaceOnDisk = spaceUsedOnDisk;
! 		state->maxSpaceStatus = state->status;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->availMem = state->allowedMem;
! 	state->lastReturnedTuple = NULL;
! 	state->slabAllocatorUsed = false;
! 	state->slabMemoryBegin = NULL;
! 	state->slabMemoryEnd = NULL;
! 	state->slabFreeHead = NULL;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3241,3258 ****
  	 * to fix.  Is it worth creating an API for the memory context code to
  	 * tell us how much is actually used in sortcontext?
  	 */
! 	if (state->tapeset)
! 	{
  		stats->spaceType = SORT_SPACE_TYPE_DISK;
- 		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! 		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3351,3365 ----
  	 * to fix.  Is it worth creating an API for the memory context code to
  	 * tell us how much is actually used in sortcontext?
  	 */
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxSpaceOnDisk)
  		stats->spaceType = SORT_SPACE_TYPE_DISK;
  	else
  		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! 	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
  
! 	switch (state->maxSpaceStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...cfe944f
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,31 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncrementalSort.h
+  *
+  *
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/executor/nodeIncrementalSort.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+ 
+ #include "access/parallel.h"
+ #include "nodes/execnodes.h"
+ 
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+ 
+ /* parallel instrumentation support */
+ extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, shm_toc *toc);
+ extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+ 
+ #endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 90a60ab..c21113a
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1730,1735 ****
--- 1730,1749 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 Shared memory container for per-worker sort information
   * ----------------
*************** typedef struct SortState
*** 1758,1763 ****
--- 1772,1815 ----
  	SharedSortInfo *shared_info;	/* one entry per worker */
  } SortState;
  
+ /* ----------------
+  *	 Shared memory container for per-worker incremental sort information
+  * ----------------
+  */
+ typedef struct IncrementalSortInfo
+ {
+ 	TuplesortInstrumentation	sinstrument;
+ 	int64						groupsCount;
+ } IncrementalSortInfo;
+ 
+ typedef struct SharedIncrementalSortInfo
+ {
+ 	int							num_workers;
+ 	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+ } SharedIncrementalSortInfo;
+ 
+ /* ----------------
+  *	 IncrementalSortState information
+  * ----------------
+  */
+ typedef struct IncrementalSortState
+ {
+ 	ScanState	ss;				/* its first field is NodeTag */
+ 	bool		bounded;		/* is the result set bounded? */
+ 	int64		bound;			/* if bounded, how many tuples are needed */
+ 	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
+ 	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	int64		bound_Done;		/* value of bound we did the sort with */
+ 	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	int64		groupsCount;	/* number of groups with equal skip keys */
+ 	TupleTableSlot *sampleSlot;	/* slot for sample tuple of sort group */
+ 	bool		am_worker;		/* are we a worker? */
+ 	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+ } IncrementalSortState;
+ 
  /* ---------------------
   *	GroupState information
   * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index 27bd4f3..ae772e8
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
  	T_HashJoin,
  	T_Material,
  	T_Sort,
+ 	T_IncrementalSort,
  	T_Group,
  	T_Agg,
  	T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
  	T_HashJoinState,
  	T_MaterialState,
  	T_SortState,
+ 	T_IncrementalSortState,
  	T_GroupState,
  	T_AggState,
  	T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
  	T_ProjectionPath,
  	T_ProjectSetPath,
  	T_SortPath,
+ 	T_IncrementalSortPath,
  	T_GroupPath,
  	T_UpperUniquePath,
  	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index a382331..c592183
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 749,754 ****
--- 749,765 ----
  	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
  } Sort;
  
+ 
+ /* ----------------
+  *		incremental sort node
+  * ----------------
+  */
+ typedef struct IncrementalSort
+ {
+ 	Sort		sort;
+ 	int			skipCols;		/* number of presorted columns */
+ } IncrementalSort;
+ 
  /* ---------------
   *	 group node -
   *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index a39e59d..5a17189
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1419,1424 ****
--- 1419,1434 ----
  } SortPath;
  
  /*
+  * IncrementalSortPath
+  */
+ typedef struct IncrementalSortPath
+ {
+ 	SortPath	spath;
+ 	int			skipCols;
+ } IncrementalSortPath;
+ 
+ 
+ /*
   * GroupPath represents grouping (of presorted input)
   *
   * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 63feba0..04553d1
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_bitmapscan;
  extern bool enable_tidscan;
  extern bool enable_sort;
+ extern bool enable_incrementalsort;
  extern bool enable_hashagg;
  extern bool enable_nestloop;
  extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 102,109 ****
  						 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 103,111 ----
  						 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 4e06b2e..4f2fe81
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 182,187 ****
--- 182,188 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 220,225 ****
--- 221,227 ----
  extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
  							  List *mergeclauses,
  							  List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
  extern List *truncate_useless_pathkeys(PlannerInfo *root,
  						  RelOptInfo *rel,
  						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 199a631..41b7196
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 206,211 ****
--- 206,214 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern void estimate_hash_bucket_stats(PlannerInfo *root,
  						   Node *hashkey, double nbuckets,
  						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index b6b8c8e..938d329
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 90,96 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 90,97 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 134,139 ****
--- 135,142 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					TuplesortInstrumentation *stats);
  extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort           
*** 19,27 ****
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Sort           
    Sort Key: id, data
!   ->  Seq Scan on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
--- 19,28 ----
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Incremental Sort
    Sort Key: id, data
!   Presorted Key: id
!   ->  Index Scan using test_dc_pkey on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 1fa9650..1883170
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE:  drop cascades to table matest1
*** 1493,1498 ****
--- 1493,1499 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
  SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1633,1641 ****
--- 1634,1678 ----
   {3,7,8,10,13,13,16,18,19,22}
  (3 rows)
  
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+                                QUERY PLAN                                
+ -------------------------------------------------------------------------
+  Merge Append
+    Sort Key: tenk1.thousand, tenk1.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+    ->  Incremental Sort
+          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
+          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Merge Append
+    Sort Key: a.thousand, a.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+    ->  Incremental Sort
+          Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
+          ->  Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  --
  -- Check that constraint exclusion works correctly with partitions using
  -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index 568b783..e60fb43
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select count(*) >= 0 as ok from pg_prepa
*** 70,90 ****
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!          name         | setting 
! ----------------------+---------
!  enable_bitmapscan    | on
!  enable_gathermerge   | on
!  enable_hashagg       | on
!  enable_hashjoin      | on
!  enable_indexonlyscan | on
!  enable_indexscan     | on
!  enable_material      | on
!  enable_mergejoin     | on
!  enable_nestloop      | on
!  enable_seqscan       | on
!  enable_sort          | on
!  enable_tidscan       | on
! (12 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
--- 70,91 ----
  -- This is to record the prevailing planner enable_foo settings during
  -- a regression test run.
  select name, setting from pg_settings where name like 'enable%';
!           name          | setting 
! ------------------------+---------
!  enable_bitmapscan      | on
!  enable_gathermerge     | on
!  enable_hashagg         | on
!  enable_hashjoin        | on
!  enable_incrementalsort | on
!  enable_indexonlyscan   | on
!  enable_indexscan       | on
!  enable_material        | on
!  enable_mergejoin       | on
!  enable_nestloop        | on
!  enable_seqscan         | on
!  enable_sort            | on
!  enable_tidscan         | on
! (13 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index c96580c..b389c63
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 527,532 ****
--- 527,533 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
*************** SELECT
*** 588,596 ****
--- 589,614 ----
      ORDER BY f.i LIMIT 10)
  FROM generate_series(1, 3) g(i);
  
+ set enable_incrementalsort = on;
+ 
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  
  --
  -- Check that constraint exclusion works correctly with partitions using

#28

Alexander Korotkov

a.korotkov@postgrespro.ru

over 8 years ago

In reply to: Alexander Korotkov (#27)

2 attachment(s)

Re: [PATCH] Incremental sort

On Thu, Sep 14, 2017 at 2:48 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

Patch rebased to current master is attached. I'm going to improve my
testing script and post new results.

New benchmarking script and results are attached. There new dataset
parameter is introduced: skew factor. Skew factor defines skew in
distribution of groups sizes.
My idea of generating is just usage of power function where power is from 0
to 1. Following formula is used to get group number for particular item
number i.
[((i / number_of_indexes) ^ power) * number_of_groups]
For example, power = 1/6 gives following distribution of groups sizes:
group number group size
0 2
1 63
2 665
3 3367
4 11529
5 31031
6 70993
7 144495
8 269297
9 468558

For convenience, instead of power itself, I use skew factor where power =
1.0 / (1.0 + skew). Therefore, with skew = 0.0, distribution of groups
sizes is uniform. Larger skew gives more skewed distribution (and that
seems to be quite intuitive). For, negative skew, group sizes are mirrored
as for corresponding positive skew. For example, skew factor = -5.0 gives
following groups sizes distribution:
group number group size
0 468558
1 269297
2 144495
3 70993
4 31031
5 11529
6 3367
7 665
8 63
9 2

Results shows that between 2172 test cases, in 2113 incremental sort gives
speedup while in 59 it causes slowdown. The following 4 test cases show
most significant slowdown (>10% of time).

Table GroupedCols GroupCount Skew PreorderedFrac
FullSortMedian IncSortMedian TimeChangePercent
int4|int4|numeric 1 100 -10 0
1.5688240528 2.0607631207 31.36
text|int8|text|int4 1 1 0 0
1.7785198689 2.1816160679 22.66
int8|int8|int4 1 10 -10 0
1.136412859 1.3166360855 15.86
numeric|text|int4|int8 2 10 -10 1
0.4403841496 0.5070910454 15.15

As you can see, 3 of this 4 test cases have skewed distribution while one
of them is related to costly location-aware comparison of text. I've no
particular idea of how to cope these slowdowns. Probably, it's OK to have
slowdown in some cases while have speedup in majority of cases (assuming
there is an option to turn off new behavior). Probably, we should teach
optimizer more about skewed distributions of groups, but that doesn't seem
feasible for me.

Any thoughts?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#29

Alexander Korotkov

a.korotkov@postgrespro.ru

over 8 years ago

In reply to: Alexander Korotkov (#28)

Re: [PATCH] Incremental sort

On Sat, Sep 16, 2017 at 2:46 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Thu, Sep 14, 2017 at 2:48 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

Patch rebased to current master is attached. I'm going to improve my
testing script and post new results.

New benchmarking script and results are attached. There new dataset
parameter is introduced: skew factor. Skew factor defines skew in
distribution of groups sizes.
My idea of generating is just usage of power function where power is from
0 to 1. Following formula is used to get group number for particular item
number i.
[((i / number_of_indexes) ^ power) * number_of_groups]
For example, power = 1/6 gives following distribution of groups sizes:
group number group size
0 2
1 63
2 665
3 3367
4 11529
5 31031
6 70993
7 144495
8 269297
9 468558

For convenience, instead of power itself, I use skew factor where power =
1.0 / (1.0 + skew). Therefore, with skew = 0.0, distribution of groups
sizes is uniform. Larger skew gives more skewed distribution (and that
seems to be quite intuitive). For, negative skew, group sizes are mirrored
as for corresponding positive skew. For example, skew factor = -5.0 gives
following groups sizes distribution:
group number group size
0 468558
1 269297
2 144495
3 70993
4 31031
5 11529
6 3367
7 665
8 63
9 2

Results shows that between 2172 test cases, in 2113 incremental sort gives
speedup while in 59 it causes slowdown. The following 4 test cases show
most significant slowdown (>10% of time).

Table GroupedCols GroupCount Skew PreorderedFrac
FullSortMedian IncSortMedian TimeChangePercent
int4|int4|numeric 1 100 -10 0
1.5688240528 2.0607631207 31.36
text|int8|text|int4 1 1 0 0
1.7785198689 <(778)%20519-8689> 2.1816160679 22.66
int8|int8|int4 1 10 -10 0
1.136412859 1.3166360855 15.86
numeric|text|int4|int8 2 10 -10 1
0.4403841496 0.5070910454 15.15

As you can see, 3 of this 4 test cases have skewed distribution while one
of them is related to costly location-aware comparison of text. I've no
particular idea of how to cope these slowdowns. Probably, it's OK to have
slowdown in some cases while have speedup in majority of cases (assuming
there is an option to turn off new behavior). Probably, we should teach
optimizer more about skewed distributions of groups, but that doesn't seem
feasible for me.

Any thoughts?

BTW, replacement selection sort was removed by 8b304b8b. I think it worth
to rerun benchmarks after that, because results might be changed. Will do.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#30

Alexander Korotkov

a.korotkov@postgrespro.ru

over 8 years ago

In reply to: Alexander Korotkov (#29)

1 attachment(s)

Re: [PATCH] Incremental sort

On Sat, Sep 30, 2017 at 11:20 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Sat, Sep 16, 2017 at 2:46 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Thu, Sep 14, 2017 at 2:48 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

Patch rebased to current master is attached. I'm going to improve my
testing script and post new results.

New benchmarking script and results are attached. There new dataset
parameter is introduced: skew factor. Skew factor defines skew in
distribution of groups sizes.
My idea of generating is just usage of power function where power is from
0 to 1. Following formula is used to get group number for particular item
number i.
[((i / number_of_indexes) ^ power) * number_of_groups]
For example, power = 1/6 gives following distribution of groups sizes:
group number group size
0 2
1 63
2 665
3 3367
4 11529
5 31031
6 70993
7 144495
8 269297
9 468558

For convenience, instead of power itself, I use skew factor where power =
1.0 / (1.0 + skew). Therefore, with skew = 0.0, distribution of groups
sizes is uniform. Larger skew gives more skewed distribution (and that
seems to be quite intuitive). For, negative skew, group sizes are mirrored
as for corresponding positive skew. For example, skew factor = -5.0 gives
following groups sizes distribution:
group number group size
0 468558
1 269297
2 144495
3 70993
4 31031
5 11529
6 3367
7 665
8 63
9 2

Results shows that between 2172 test cases, in 2113 incremental sort
gives speedup while in 59 it causes slowdown. The following 4 test cases
show most significant slowdown (>10% of time).

Table GroupedCols GroupCount Skew PreorderedFrac
FullSortMedian IncSortMedian TimeChangePercent
int4|int4|numeric 1 100 -10 0
1.5688240528 2.0607631207 31.36
text|int8|text|int4 1 1 0 0
1.7785198689 <(778)%20519-8689> 2.1816160679 22.66
int8|int8|int4 1 10 -10 0
1.136412859 1.3166360855 15.86
numeric|text|int4|int8 2 10 -10 1
0.4403841496 0.5070910454 15.15

As you can see, 3 of this 4 test cases have skewed distribution while one
of them is related to costly location-aware comparison of text. I've no
particular idea of how to cope these slowdowns. Probably, it's OK to have
slowdown in some cases while have speedup in majority of cases (assuming
there is an option to turn off new behavior). Probably, we should teach
optimizer more about skewed distributions of groups, but that doesn't seem
feasible for me.

Any thoughts?

BTW, replacement selection sort was removed by 8b304b8b. I think it worth
to rerun benchmarks after that, because results might be changed. Will do.

I've applied patch on top of c12d570f and rerun the same benchmarks.
CSV-file with results is attached. There is no dramatical changes. There
is still minority of performance regression cases while majority of cases
has improvement.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#31

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Alexander Korotkov (#30)

Re: [PATCH] Incremental sort

On Mon, Oct 2, 2017 at 12:37 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

I've applied patch on top of c12d570f and rerun the same benchmarks.
CSV-file with results is attached. There is no dramatical changes. There
is still minority of performance regression cases while majority of cases
has improvement.

Yes, I think these results look pretty good. But are these times in
seconds? You might need to do some testing with bigger sorts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Alexander Korotkov

a.korotkov@postgrespro.ru

over 8 years ago

In reply to: Robert Haas (#31)

Re: [PATCH] Incremental sort

On Tue, Oct 3, 2017 at 2:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 2, 2017 at 12:37 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

I've applied patch on top of c12d570f and rerun the same benchmarks.
CSV-file with results is attached. There is no dramatical changes.

There

is still minority of performance regression cases while majority of cases
has improvement.

Yes, I think these results look pretty good. But are these times in
seconds? You might need to do some testing with bigger sorts.

Good point. I'll rerun benchmarks with larger dataset size.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#33

Antonin Houska

ah@cybertec.at

about 8 years ago

In reply to: Alexander Korotkov (#27)

Re: [HACKERS] [PATCH] Incremental sort

Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Patch rebased to current master is attached. I'm going to improve my testing script and post new results.

I wanted to review this patch but incremental-sort-8.patch fails to apply. Can
you please rebase it again?

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

#34

Alexander Korotkov

a.korotkov@postgrespro.ru

about 8 years ago

In reply to: Antonin Houska (#33)

2 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On Tue, Nov 14, 2017 at 7:00 PM, Antonin Houska <ah@cybertec.at> wrote:

Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Patch rebased to current master is attached. I'm going to improve my

testing script and post new results.

I wanted to review this patch but incremental-sort-8.patch fails to apply.
Can
you please rebase it again?

Sure, please find rebased patch attached.
Also, I'd like to share partial results of benchmarks with 100M of rows.
It appears that for 100M of rows it takes quite amount of time. Perhaps in
cases when there were degradation on 1M of rows, it becomes somewhat
bigger...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-9.patchapplication/octet-stream; name=incremental-sort-9.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 4339bbf..df72ab1
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1981,2019 ****
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
   Limit
!    Output: t1.c1, t2.c1
     ->  Sort
!          Output: t1.c1, t2.c1
!          Sort Key: t1.c1, t2.c1
           ->  Nested Loop
!                Output: t1.c1, t2.c1
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c1
!                      Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c1
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c1
!                            Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!  c1 | c1  
! ----+-----
!   1 | 101
!   1 | 102
!   1 | 103
!   1 | 104
!   1 | 105
!   1 | 106
!   1 | 107
!   1 | 108
!   1 | 109
!   1 | 110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
--- 1981,2019 ----
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!                             QUERY PLAN                            
! ------------------------------------------------------------------
   Limit
!    Output: t1.c3, t2.c3
     ->  Sort
!          Output: t1.c3, t2.c3
!          Sort Key: t1.c3, t2.c3
           ->  Nested Loop
!                Output: t1.c3, t2.c3
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c3
!                      Remote SQL: SELECT c3 FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c3
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c3
!                            Remote SQL: SELECT c3 FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!   c3   |  c3   
! -------+-------
!  00001 | 00101
!  00001 | 00102
!  00001 | 00103
!  00001 | 00104
!  00001 | 00105
!  00001 | 00106
!  00001 | 00107
!  00001 | 00108
!  00001 | 00109
!  00001 | 00110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index ddfec79..c8c6fb7
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index d360fc4..1e878bf
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3552,3557 ****
--- 3552,3571 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+       <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of incremental sort
+         steps. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
        <term><varname>enable_indexscan</varname> (<type>boolean</type>)
        <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 8f7062c..b46dc17
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual, 
*** 80,85 ****
--- 80,87 ----
  				ExplainState *es);
  static void show_sort_keys(SortState *sortstate, List *ancestors,
  			   ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 					   List *ancestors, ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
  static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
  				 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 									   ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  					ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1010,1015 ****
--- 1014,1022 ----
  		case T_Sort:
  			pname = sname = "Sort";
  			break;
+ 		case T_IncrementalSort:
+ 			pname = sname = "Incremental Sort";
+ 			break;
  		case T_Group:
  			pname = sname = "Group";
  			break;
*************** ExplainNode(PlanState *planstate, List *
*** 1600,1605 ****
--- 1607,1618 ----
  			show_sort_keys(castNode(SortState, planstate), ancestors, es);
  			show_sort_info(castNode(SortState, planstate), es);
  			break;
+ 		case T_IncrementalSort:
+ 			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ 									   ancestors, es);
+ 			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ 									   es);
+ 			break;
  		case T_MergeAppend:
  			show_merge_append_keys(castNode(MergeAppendState, planstate),
  								   ancestors, es);
*************** static void
*** 1925,1939 ****
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
   * Likewise, for a MergeAppend node.
   */
  static void
--- 1938,1975 ----
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+ 	int			skipCols;
+ 
+ 	if (IsA(plan, IncrementalSort))
+ 		skipCols = ((IncrementalSort *) plan)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
+  * Show the sort keys for a IncrementalSort node.
+  */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 						   List *ancestors, ExplainState *es)
+ {
+ 	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+ 
+ 	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ 						 plan->sort.numCols, plan->skipCols,
+ 						 plan->sort.sortColIdx,
+ 						 plan->sort.sortOperators, plan->sort.collations,
+ 						 plan->sort.nullsFirst,
+ 						 ancestors, es);
+ }
+ 
+ /*
   * Likewise, for a MergeAppend node.
   */
  static void
*************** show_merge_append_keys(MergeAppendState 
*** 1943,1949 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1979,1985 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1967,1973 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 2003,2009 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 2036,2042 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 2072,2078 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2093,2099 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 2129,2135 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2106,2118 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 2142,2155 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2152,2160 ****
--- 2189,2201 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2363,2368 ****
--- 2404,2498 ----
  }
  
  /*
+  * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+  */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 						   ExplainState *es)
+ {
+ 	if (es->analyze && incrsortstate->sort_Done &&
+ 		incrsortstate->tuplesortstate != NULL)
+ 	{
+ 		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ 		TuplesortInstrumentation stats;
+ 		const char *sortMethod;
+ 		const char *spaceType;
+ 		long		spaceUsed;
+ 
+ 		tuplesort_get_stats(state, &stats);
+ 		sortMethod = tuplesort_method_name(stats.sortMethod);
+ 		spaceType = tuplesort_space_type_name(stats.spaceType);
+ 		spaceUsed = stats.spaceUsed;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+ 							 sortMethod, spaceType, spaceUsed);
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Groups: %ld\n",
+ 							 incrsortstate->groupsCount);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyText("Sort Method", sortMethod, es);
+ 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			ExplainPropertyLong("Sort Groups: %ld",
+ 								incrsortstate->groupsCount, es);
+ 		}
+ 	}
+ 
+ 	if (incrsortstate->shared_info != NULL)
+ 	{
+ 		int			n;
+ 		bool		opened_group = false;
+ 
+ 		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ 		{
+ 			TuplesortInstrumentation *sinstrument;
+ 			const char *sortMethod;
+ 			const char *spaceType;
+ 			long		spaceUsed;
+ 			int64		groupsCount;
+ 
+ 			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ 			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ 			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ 				continue;		/* ignore any unfilled slots */
+ 			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ 			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ 			spaceUsed = sinstrument->spaceUsed;
+ 
+ 			if (es->format == EXPLAIN_FORMAT_TEXT)
+ 			{
+ 				appendStringInfoSpaces(es->str, es->indent * 2);
+ 				appendStringInfo(es->str,
+ 								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+ 								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+ 			}
+ 			else
+ 			{
+ 				if (!opened_group)
+ 				{
+ 					ExplainOpenGroup("Workers", "Workers", false, es);
+ 					opened_group = true;
+ 				}
+ 				ExplainOpenGroup("Worker", NULL, true, es);
+ 				ExplainPropertyInteger("Worker Number", n, es);
+ 				ExplainPropertyText("Sort Method", sortMethod, es);
+ 				ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 				ExplainPropertyText("Sort Space Type", spaceType, es);
+ 				ExplainPropertyLong("Sort Groups", groupsCount, es);
+ 				ExplainCloseGroup("Worker", NULL, true, es);
+ 			}
+ 		}
+ 		if (opened_group)
+ 			ExplainCloseGroup("Workers", "Workers", false, es);
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index 083b20f..b093618
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
!        nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
!        nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index f1636a5..dd8cffe
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 31,36 ****
--- 31,37 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 253,258 ****
--- 254,263 ----
  			ExecReScanSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecReScanIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecReScanGroup((GroupState *) node);
  			break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 525,532 ****
--- 530,541 ----
  		case T_CteScan:
  		case T_Material:
  		case T_Sort:
+ 			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_IncrementalSort:
+ 			return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index fd7e7cb..74c1da9
*** a/src/backend/executor/execParallel.c
--- b/src/backend/executor/execParallel.c
***************
*** 28,33 ****
--- 28,34 ----
  #include "executor/nodeBitmapHeapscan.h"
  #include "executor/nodeCustom.h"
  #include "executor/nodeForeignscan.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeSeqscan.h"
*************** ExecParallelEstimate(PlanState *planstat
*** 258,263 ****
--- 259,268 ----
  			/* even when not parallel-aware */
  			ExecSortEstimate((SortState *) planstate, e->pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelInitializeDSM(PlanState *pla
*** 330,335 ****
--- 335,344 ----
  			/* even when not parallel-aware */
  			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelReInitializeDSM(PlanState *p
*** 703,708 ****
--- 712,721 ----
  			/* even when not parallel-aware */
  			ExecSortReInitializeDSM((SortState *) planstate, pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelRetrieveInstrumentation(Plan
*** 761,766 ****
--- 774,781 ----
  	 */
  	if (IsA(planstate, SortState))
  		ExecSortRetrieveInstrumentation((SortState *) planstate);
+ 	else if (IsA(planstate, IncrementalSortState))
+ 		ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
  
  	return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
  								 instrumentation);
*************** ExecParallelInitializeWorker(PlanState *
*** 982,987 ****
--- 997,1006 ----
  			/* even when not parallel-aware */
  			ExecSortInitializeWorker((SortState *) planstate, toc);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate, toc);
+ 			break;
  
  		default:
  			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index c1aa506..e4225df
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 88,93 ****
--- 88,94 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 314,319 ****
--- 315,325 ----
  												estate, eflags);
  			break;
  
+ 		case T_IncrementalSort:
+ 			result = (PlanState *) ExecInitIncrementalSort(
+ 									(IncrementalSort *) node, estate, eflags);
+ 			break;
+ 
  		case T_Group:
  			result = (PlanState *) ExecInitGroup((Group *) node,
  												 estate, eflags);
*************** ExecEndNode(PlanState *node)
*** 679,684 ****
--- 685,694 ----
  			ExecEndSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecEndIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecEndGroup((GroupState *) node);
  			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index d26ce08..3c37bda
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 666,671 ****
--- 666,672 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 753,759 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 754,760 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...04059cc
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,644 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncremenalSort.c
+  *	  Routines to handle incremental sorting of relations.
+  *
+  * DESCRIPTION
+  *
+  *		Incremental sort is specially optimized kind of multikey sort when
+  *		input is already presorted by prefix of required keys list.  Thus,
+  *		when it's required to sort by (key1, key2 ... keyN) and result is
+  *		already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+  *		values of (key1, key2 ... keyM) are equal.
+  *
+  *		Consider following example.  We have input tuples consisting from
+  *		two integers (x, y) already presorted by x, while it's required to
+  *		sort them by x and y.  Let input tuples be following.
+  *
+  *		(1, 5)
+  *		(1, 2)
+  *		(2, 10)
+  *		(2, 1)
+  *		(2, 5)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort algorithm would sort by xfollowing groups, which have
+  *		equal x, individually:
+  *			(1, 5) (1, 2)
+  *			(2, 10) (2, 1) (2, 5)
+  *			(3, 3) (3, 7)
+  *
+  *		After sorting these groups and putting them altogether, we would get
+  *		following tuple set which is actually sorted by x and y.
+  *
+  *		(1, 2)
+  *		(1, 5)
+  *		(2, 1)
+  *		(2, 5)
+  *		(2, 10)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort is faster than full sort on large datasets.  But
+  *		the case of most huge benefit of incremental sort is queries with
+  *		LIMIT because incremental sort can return first tuples without reading
+  *		whole input dataset.
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/executor/nodeIncremenalSort.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+ 
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ 															TupleTableSlot *b)
+ {
+ 	int n, i;
+ 
+ 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+ 
+ 	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = slot_getattr(a, attno, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	int					skipCols,
+ 						i;
+ 
+ 	Assert(IsA(plannode, IncrementalSort));
+ 	skipCols = plannode->skipCols;
+ 
+ 	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sort.sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 										plannode->sort.sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sort.sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->sort.collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
+ 
+ #define MIN_GROUP_SIZE 32
+ 
+ /* ----------------------------------------------------------------
+  *		ExecIncrementalSort
+  *
+  *		Assuming that outer subtree returns tuple presorted by some prefix
+  *		of target sort columns, performs incremental sort.  It fetches
+  *		groups of tuples where prefix sort columns are equal and sorts them
+  *		using tuplesort.  This approach allows to evade sorting of whole
+  *		dataset.  Besides taking less memory and being faster, it allows to
+  *		start returning tuples before fetching full dataset from outer
+  *		subtree.
+  *
+  *		Conditions:
+  *		  -- none.
+  *
+  *		Initial States:
+  *		  -- the outer child is prepared to return the first tuple.
+  * ----------------------------------------------------------------
+  */
+ static TupleTableSlot *
+ ExecIncrementalSort(PlanState *pstate)
+ {
+ 	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ 	EState			   *estate;
+ 	ScanDirection		dir;
+ 	Tuplesortstate	   *tuplesortstate;
+ 	TupleTableSlot	   *slot;
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	PlanState		   *outerNode;
+ 	TupleDesc			tupDesc;
+ 	int64				nTuples = 0;
+ 
+ 	/*
+ 	 * get state info from node
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "entering routine");
+ 
+ 	estate = node->ss.ps.state;
+ 	dir = estate->es_direction;
+ 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+ 
+ 	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  false, slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
+ 	 * If first time through, read all tuples from outer plan and pass them to
+ 	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ 	 */
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "sorting subplan");
+ 
+ 	/*
+ 	 * Want to scan subplan in the forward direction while creating the
+ 	 * sorted data.
+ 	 */
+ 	estate->es_direction = ForwardScanDirection;
+ 
+ 	/*
+ 	 * Initialize tuplesort module.
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "calling tuplesort_begin");
+ 
+ 	outerNode = outerPlanState(node);
+ 	tupDesc = ExecGetResultType(outerNode);
+ 
+ 	if (node->tuplesortstate == NULL)
+ 	{
+ 		/*
+ 		 * We are going to process the first group of presorted data.
+ 		 * Initialize support structures for cmpSortSkipCols - already
+ 		 * sorted columns.
+ 		 */
+ 		prepareSkipCols(node);
+ 
+ 		/*
+ 		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+ 		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+ 		 * necessary have equal value of the first column.  We unlikely will
+ 		 * have huge groups with incremental sort.  Therefore usage of
+ 		 * abbreviated keys would be likely a waste of time.
+ 		 */
+ 		tuplesortstate = tuplesort_begin_heap(
+ 									tupDesc,
+ 									plannode->sort.numCols,
+ 									plannode->sort.sortColIdx,
+ 									plannode->sort.sortOperators,
+ 									plannode->sort.collations,
+ 									plannode->sort.nullsFirst,
+ 									work_mem,
+ 									false,
+ 									true);
+ 		node->tuplesortstate = (void *) tuplesortstate;
+ 		node->groupsCount++;
+ 	}
+ 	else
+ 	{
+ 		/* Next group of presorted data */
+ 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ 		node->groupsCount++;
+ 	}
+ 
+ 	/* Calculate remaining bound for bounded sort */
+ 	if (node->bounded)
+ 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ 
+ 	/* Put saved tuple to tuplesort if any */
+ 	if (!TupIsNull(node->sampleSlot))
+ 	{
+ 		tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ 		ExecClearTuple(node->sampleSlot);
+ 		nTuples++;
+ 	}
+ 
+ 	/*
+ 	 * Put next group of tuples where skipCols sort values are equal to
+ 	 * tuplesort.
+ 	 */
+ 	for (;;)
+ 	{
+ 		slot = ExecProcNode(outerNode);
+ 
+ 		if (TupIsNull(slot))
+ 		{
+ 			node->finished = true;
+ 			break;
+ 		}
+ 
+ 		/* Put next group of presorted data to the tuplesort */
+ 		if (nTuples < MIN_GROUP_SIZE)
+ 		{
+ 			tuplesort_puttupleslot(tuplesortstate, slot);
+ 
+ 			/* Save last tuple in minimal group */
+ 			if (nTuples == MIN_GROUP_SIZE - 1)
+ 				ExecCopySlot(node->sampleSlot, slot);
+ 			nTuples++;
+ 		}
+ 		else
+ 		{
+ 			/* Interate while skip cols are same as in saved tuple */
+ 			bool	cmp;
+ 			cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+ 
+ 			if (cmp)
+ 			{
+ 				tuplesort_puttupleslot(tuplesortstate, slot);
+ 				nTuples++;
+ 			}
+ 			else
+ 			{
+ 				ExecCopySlot(node->sampleSlot, slot);
+ 				break;
+ 			}
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Complete the sort.
+ 	 */
+ 	tuplesort_performsort(tuplesortstate);
+ 
+ 	/*
+ 	 * restore to user specified direction
+ 	 */
+ 	estate->es_direction = dir;
+ 
+ 	/*
+ 	 * finally set the sorted flag to true
+ 	 */
+ 	node->sort_Done = true;
+ 	node->bounded_Done = node->bounded;
+ 	if (node->shared_info && node->am_worker)
+ 	{
+ 		TuplesortInstrumentation *si;
+ 
+ 		Assert(IsParallelWorker());
+ 		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ 		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ 		tuplesort_get_stats(tuplesortstate, si);
+ 		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ 															node->groupsCount;
+ 	}
+ 
+ 	/*
+ 	 * Adjust bound_Done with number of tuples we've actually sorted.
+ 	 */
+ 	if (node->bounded)
+ 	{
+ 		if (node->finished)
+ 			node->bound_Done = node->bound;
+ 		else
+ 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ 	}
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "retrieving tuple from tuplesort");
+ 
+ 	/*
+ 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+ 	 * tuples.
+ 	 */
+ 	slot = node->ss.ps.ps_ResultTupleSlot;
+ 	(void) tuplesort_gettupleslot(tuplesortstate,
+ 								  ScanDirectionIsForward(dir),
+ 								  false, slot, NULL);
+ 	return slot;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitIncrementalSort
+  *
+  *		Creates the run-time state information for the sort node
+  *		produced by the planner and initializes its outer subtree.
+  * ----------------------------------------------------------------
+  */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ 	IncrementalSortState   *incrsortstate;
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "initializing sort node");
+ 
+ 	/*
+ 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ 	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ 	 * bucket in tuplesortstate.
+ 	 */
+ 	Assert((eflags & (EXEC_FLAG_REWIND |
+ 					  EXEC_FLAG_BACKWARD |
+ 					  EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
+ 	 * create state structure
+ 	 */
+ 	incrsortstate = makeNode(IncrementalSortState);
+ 	incrsortstate->ss.ps.plan = (Plan *) node;
+ 	incrsortstate->ss.ps.state = estate;
+ 	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+ 
+ 	incrsortstate->bounded = false;
+ 	incrsortstate->sort_Done = false;
+ 	incrsortstate->finished = false;
+ 	incrsortstate->tuplesortstate = NULL;
+ 	incrsortstate->sampleSlot = NULL;
+ 	incrsortstate->bound_Done = 0;
+ 	incrsortstate->groupsCount = 0;
+ 	incrsortstate->skipKeys = NULL;
+ 
+ 	/*
+ 	 * Miscellaneous initialization
+ 	 *
+ 	 * Sort nodes don't initialize their ExprContexts because they never call
+ 	 * ExecQual or ExecProject.
+ 	 */
+ 
+ 	/*
+ 	 * tuple table initialization
+ 	 *
+ 	 * sort nodes only return scan tuples from their sorted relation.
+ 	 */
+ 	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ 	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+ 
+ 	/*
+ 	 * initialize child nodes
+ 	 *
+ 	 * We shield the child node from the need to support REWIND, BACKWARD, or
+ 	 * MARK/RESTORE.
+ 	 */
+ 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+ 
+ 	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+ 
+ 	/*
+ 	 * initialize tuple type.  no need to initialize projection info because
+ 	 * this node doesn't do projections.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ 	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+ 
+ 	/* make standalone slot to store previous tuple from outer node */
+ 	incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ 							ExecGetResultType(outerPlanState(incrsortstate)));
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "sort node initialized");
+ 
+ 	return incrsortstate;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecEndIncrementalSort(node)
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "shutting down sort node");
+ 
+ 	/*
+ 	 * clean out the tuple table
+ 	 */
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	/* must drop stanalone tuple slot from outer node */
+ 	ExecDropSingleTupleTableSlot(node->sampleSlot);
+ 
+ 	/*
+ 	 * Release tuplesort resources
+ 	 */
+ 	if (node->tuplesortstate != NULL)
+ 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 
+ 	/*
+ 	 * shut down the subplan
+ 	 */
+ 	ExecEndNode(outerPlanState(node));
+ 
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "sort node shutdown");
+ }
+ 
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ 	PlanState  *outerPlan = outerPlanState(node);
+ 
+ 	/*
+ 	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ 	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ 	 * re-scan it at all.
+ 	 */
+ 	if (!node->sort_Done)
+ 		return;
+ 
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 
+ 	/*
+ 	 * If subnode is to be rescanned then we forget previous sort results; we
+ 	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+ 	 * bounded-sort parameters changed or we didn't select randomAccess.
+ 	 *
+ 	 * Otherwise we can just rewind and rescan the sorted output.
+ 	 */
+ 	node->sort_Done = false;
+ 	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 	node->bound_Done = 0;
+ 
+ 	/*
+ 	 * if chgParam of subnode is not null then plan will be re-scanned by
+ 	 * first ExecProcNode.
+ 	 */
+ 	if (outerPlan->chgParam == NULL)
+ 		ExecReScan(outerPlan);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *						Parallel Query Support
+  * ----------------------------------------------------------------
+  */
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortEstimate
+  *
+  *		Estimate space required to propagate sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	Size		size;
+ 
+ 	/* don't need this if not instrumenting or no workers */
+ 	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ 		return;
+ 
+ 	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ 	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ 	shm_toc_estimate_chunk(&pcxt->estimator, size);
+ 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortInitializeDSM
+  *
+  *		Initialize DSM space for sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	Size		size;
+ 
+ 	/* don't need this if not instrumenting or no workers */
+ 	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ 		return;
+ 
+ 	size = offsetof(SharedIncrementalSortInfo, sinfo)
+ 		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+ 	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ 	/* ensure any unfilled slots will contain zeroes */
+ 	memset(node->shared_info, 0, size);
+ 	node->shared_info->num_workers = pcxt->nworkers;
+ 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ 				   node->shared_info);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortReInitializeDSM
+  *
+  *		Reset shared state before beginning a fresh scan.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	/* If there's any instrumentation space, clear it for next time */
+ 	if (node->shared_info != NULL)
+ 	{
+ 		memset(node->shared_info->sinfo, 0,
+ 			   node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ 	}
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortInitializeWorker
+  *
+  *		Attach worker to DSM space for sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortInitializeWorker(IncrementalSortState *node, shm_toc *toc)
+ {
+ 	node->shared_info =
+ 		shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id, true);
+ 	node->am_worker = true;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortRetrieveInstrumentation
+  *
+  *		Transfer sort statistics from DSM to private memory.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+ {
+ 	Size		size;
+ 	SharedIncrementalSortInfo *si;
+ 
+ 	if (node->shared_info == NULL)
+ 		return;
+ 
+ 	size = offsetof(SharedIncrementalSortInfo, sinfo)
+ 		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ 	si = palloc(size);
+ 	memcpy(si, node->shared_info, size);
+ 	node->shared_info = si;
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 98bcaeb..2bddf63
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(PlanState *pstate)
*** 93,99 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
--- 93,100 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 76e7545..a0061a6
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 917,922 ****
--- 917,940 ----
  
  
  /*
+  * CopySortFields
+  *
+  *		This function copies the fields of the Sort node.  It is used by
+  *		all the copy functions for classes which inherit from Sort.
+  */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ 
+ 	COPY_SCALAR_FIELD(numCols);
+ 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+ 
+ /*
   * _copySort
   */
  static Sort *
*************** _copySort(const Sort *from)
*** 927,939 ****
  	/*
  	 * copy node superclass fields
  	 */
! 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
! 	COPY_SCALAR_FIELD(numCols);
! 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
  
  	return newnode;
  }
--- 945,973 ----
  	/*
  	 * copy node superclass fields
  	 */
! 	CopySortFields(from, newnode);
  
! 	return newnode;
! }
! 
! 
! /*
!  * _copyIncrementalSort
!  */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! 	IncrementalSort	   *newnode = makeNode(IncrementalSort);
! 
! 	/*
! 	 * copy node superclass fields
! 	 */
! 	CopySortFields((const Sort *) from, (Sort *) newnode);
! 
! 	/*
! 	 * copy remainder of node
! 	 */
! 	COPY_SCALAR_FIELD(skipCols);
  
  	return newnode;
  }
*************** copyObjectImpl(const void *from)
*** 4801,4806 ****
--- 4835,4843 ----
  		case T_Sort:
  			retval = _copySort(from);
  			break;
+ 		case T_IncrementalSort:
+ 			retval = _copyIncrementalSort(from);
+ 			break;
  		case T_Group:
  			retval = _copyGroup(from);
  			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index dc35df9..a6709c9
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 866,877 ****
  }
  
  static void
! _outSort(StringInfo str, const Sort *node)
  {
  	int			i;
  
- 	WRITE_NODE_TYPE("SORT");
- 
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
--- 866,875 ----
  }
  
  static void
! _outSortInfo(StringInfo str, const Sort *node)
  {
  	int			i;
  
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 894,899 ****
--- 892,915 ----
  }
  
  static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ 	WRITE_NODE_TYPE("SORT");
+ 
+ 	_outSortInfo(str, node);
+ }
+ 
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ 	WRITE_NODE_TYPE("INCREMENTALSORT");
+ 
+ 	_outSortInfo(str, (const Sort *) node);
+ 
+ 	WRITE_INT_FIELD(skipCols);
+ }
+ 
+ static void
  _outUnique(StringInfo str, const Unique *node)
  {
  	int			i;
*************** outNode(StringInfo str, const void *obj)
*** 3734,3739 ****
--- 3750,3758 ----
  			case T_Sort:
  				_outSort(str, obj);
  				break;
+ 			case T_IncrementalSort:
+ 				_outIncrementalSort(str, obj);
+ 				break;
  			case T_Unique:
  				_outUnique(str, obj);
  				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 593658d..9e8476a
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2059,2070 ****
  }
  
  /*
!  * _readSort
   */
! static Sort *
! _readSort(void)
  {
! 	READ_LOCALS(Sort);
  
  	ReadCommonPlan(&local_node->plan);
  
--- 2059,2071 ----
  }
  
  /*
!  * ReadCommonSort
!  *	Assign the basic stuff of all nodes that inherit from Sort
   */
! static void
! ReadCommonSort(Sort *local_node)
  {
! 	READ_TEMP_LOCALS();
  
  	ReadCommonPlan(&local_node->plan);
  
*************** _readSort(void)
*** 2073,2078 ****
--- 2074,2105 ----
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
  	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+ 
+ /*
+  * _readSort
+  */
+ static Sort *
+ _readSort(void)
+ {
+ 	READ_LOCALS_NO_FIELDS(Sort);
+ 
+ 	ReadCommonSort(local_node);
+ 
+ 	READ_DONE();
+ }
+ 
+ /*
+  * _readIncrementalSort
+  */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ 	READ_LOCALS(IncrementalSort);
+ 
+ 	ReadCommonSort(&local_node->sort);
+ 
+ 	READ_INT_FIELD(skipCols);
  
  	READ_DONE();
  }
*************** parseNodeString(void)
*** 2632,2637 ****
--- 2659,2666 ----
  		return_value = _readMaterial();
  	else if (MATCH("SORT", 4))
  		return_value = _readSort();
+ 	else if (MATCH("INCREMENTALSORT", 7))
+ 		return_value = _readIncrementalSort();
  	else if (MATCH("GROUP", 5))
  		return_value = _readGroup();
  	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 906d08a..28f2b74
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3459,3464 ****
--- 3459,3468 ----
  			ptype = "Sort";
  			subpath = ((SortPath *) path)->subpath;
  			break;
+ 		case T_IncrementalSortPath:
+ 			ptype = "IncrementalSort";
+ 			subpath = ((SortPath *) path)->subpath;
+ 			break;
  		case T_GroupPath:
  			ptype = "Group";
  			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 2d2df60..e56b3a2
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
  bool		enable_bitmapscan = true;
  bool		enable_tidscan = true;
  bool		enable_sort = true;
+ bool		enable_incrementalsort = true;
  bool		enable_hashagg = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
*************** cost_recursive_union(Path *runion, Path 
*** 1601,1606 ****
--- 1602,1614 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or incremental sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1627,1633 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1635,1642 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1643,1661 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
  
  	path->rows = tuples;
  
--- 1652,1679 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
+ 	if (!enable_incrementalsort)
+ 		presorted_keys = false;
  
  	path->rows = tuples;
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1681,1693 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1699,1748 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1697,1703 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1752,1758 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1708,1717 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1763,1772 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1719,1732 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1774,1806 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/*
! 		 * We'll use plain quicksort on all the input tuples.  If it appears
! 		 * that we expect less than two tuples per sort group then assume
! 		 * logarithmic part of estimate to be 1.
! 		 */
! 		if (group_tuples >= 2.0)
! 			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! 		else
! 			group_cost = comparison_cost * group_tuples;
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1737,1742 ****
--- 1811,1829 ----
  	 */
  	run_cost += cpu_operator_cost * tuples;
  
+ 	/* Extra costs of incremental sort */
+ 	if (presorted_keys > 0)
+ 	{
+ 		/*
+ 		 * In incremental sort case we also have to cost to detect sort groups.
+ 		 * It turns out into extra copy and comparison for each tuple.
+ 		 */
+ 		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+ 
+ 		/* Cost of per group tuplesort reset */
+ 		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ 	}
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2544,2549 ****
--- 2631,2638 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2570,2575 ****
--- 2659,2666 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index c6870d3..30a755c
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
  	return PATHKEYS_EQUAL;
  }
  
+ 
+ /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
  /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
*************** get_cheapest_path_for_pathkeys(List *pat
*** 373,380 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 402,413 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied only partially then we would have to do
!  *	  incremental sort in order to satisfy pathkeys completely.  Since
!  *	  incremental sort consumes data by presorted groups, we would have to
!  *	  consume more data than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
! 	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
- 
- 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1521,1562 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by incremental sort.
   */
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
  {
! 	int	n_common_pathkeys;
! 
! 	if (query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
! 
! 	if (enable_incrementalsort)
  	{
! 		/*
! 		 * Return the number of path keys in common, or 0 if there are none. Any
! 		 * first common pathkeys could be useful for ordering because we can use
! 		 * incremental sort.
! 		 */
! 		return n_common_pathkeys;
! 	}
! 	else
! 	{
! 		/* 
! 		 * When incremental sort is disabled, pathkeys are useful only when they
! 		 * do contain all the query pathkeys.
! 		 */
! 		if (n_common_pathkeys == list_length(query_pathkeys))
! 			return n_common_pathkeys;
! 		else
! 			return 0;
  	}
  }
  
  /*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
--- 1572,1578 ----
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 9c74e39..71b2b4a
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 235,241 ****
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 235,241 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static EquivalenceMember *find_ec_member
*** 251,260 ****
  					   TargetEntry *tle,
  					   Relids relids);
  static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						Relids relids);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 251,261 ----
  					   TargetEntry *tle,
  					   Relids relids);
  static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						Relids relids, int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 436,441 ****
--- 437,443 ----
  											   (GatherPath *) best_path);
  			break;
  		case T_Sort:
+ 		case T_IncrementalSort:
  			plan = (Plan *) create_sort_plan(root,
  											 (SortPath *) best_path,
  											 flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1120,1125 ****
--- 1122,1128 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1154,1162 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1157,1167 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1506,1511 ****
--- 1511,1517 ----
  	Plan	   *subplan;
  	List	   *pathkeys = best_path->path.pathkeys;
  	List	   *tlist = build_path_tlist(root, &best_path->path);
+ 	int			n_common_pathkeys;
  
  	/* As with Gather, it's best to project away columns in the workers. */
  	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1535,1546 ****
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
--- 1541,1556 ----
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! 	if (n_common_pathkeys < list_length(pathkeys))
! 	{
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ 									 n_common_pathkeys,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
+ 	}
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1653,1658 ****
--- 1663,1669 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1662,1668 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1673,1683 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   NULL, n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1906,1912 ****
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan);
  			}
  
  			if (!rollup->is_hashed)
--- 1921,1928 ----
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan,
! 											 0);
  			}
  
  			if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3846,3855 ****
  	 */
  	if (best_path->outersortkeys)
  	{
  		Relids		outer_relids = outer_path->parent->relids;
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys,
! 												   outer_relids);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3862,3876 ----
  	 */
  	if (best_path->outersortkeys)
  	{
+ 		Sort	   *sort;
+ 		int			n_common_pathkeys;
  		Relids		outer_relids = outer_path->parent->relids;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   outer_relids, n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3860,3869 ****
  
  	if (best_path->innersortkeys)
  	{
  		Relids		inner_relids = inner_path->parent->relids;
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys,
! 												   inner_relids);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3881,3895 ----
  
  	if (best_path->innersortkeys)
  	{
+ 		Sort	   *sort;
+ 		int			n_common_pathkeys;
  		Relids		inner_relids = inner_path->parent->relids;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   inner_relids, n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4915,4921 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4941,4948 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5504,5516 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node = makeNode(Sort);
! 	Plan	   *plan = &node->plan;
  
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
--- 5531,5561 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node;
! 	Plan	   *plan;
! 
! 	/* Always use regular sort node when enable_incrementalsort = false */
! 	if (!enable_incrementalsort)
! 		skipCols = 0;
! 
! 	if (skipCols == 0)
! 	{
! 		node = makeNode(Sort);
! 	}
! 	else
! 	{
! 		IncrementalSort    *incrementalSort;
! 
! 		incrementalSort = makeNode(IncrementalSort);
! 		node = &incrementalSort->sort;
! 		incrementalSort->skipCols = skipCols;
! 	}
  
+ 	plan = &node->plan;
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5845,5851 ****
   *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5890,5897 ----
   *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						Relids relids, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5865,5871 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5911,5917 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5908,5914 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5954,5960 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5929,5935 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5975,5982 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5962,5968 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 6009,6015 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** is_projection_capable_plan(Plan *plan)
*** 6618,6623 ****
--- 6665,6671 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 889e8af..49af1f1
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 90fd9cc..ce2acac
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3843,3856 ****
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				bool		is_sorted;
  
! 				is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 												  path->pathkeys);
! 				if (path == cheapest_partial_path || is_sorted)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (!is_sorted)
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
--- 3843,3856 ----
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				int			n_useful_pathkeys;
  
! 				n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (n_useful_pathkeys < list_length(root->group_pathkeys))
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3923,3936 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3923,3936 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_useful_pathkeys;
  
! 			n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 			if (path == cheapest_path || n_useful_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4997,5009 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 4997,5009 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_useful_pathkeys;
  
! 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! 														 path->pathkeys);
! 		if (path == cheapest_input_path || n_useful_pathkeys > 0)
  		{
! 			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 6133,6140 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 6133,6141 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index fa9a3f0..407568a
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 638,643 ****
--- 638,644 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 2e3abee..0ee6812
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2782,2787 ****
--- 2782,2788 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index f620243..c83161f
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 988,994 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 988,995 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 68dee0f..1c2b815
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 103,109 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 103,109 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1304,1315 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1304,1316 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1323,1328 ****
--- 1324,1331 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1564,1570 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1567,1574 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1657,1662 ****
--- 1661,1667 ----
  	GatherMergePath *pathnode = makeNode(GatherMergePath);
  	Cost		input_startup_cost = 0;
  	Cost		input_total_cost = 0;
+ 	int			n_common_pathkeys;
  
  	Assert(subpath->parallel_safe);
  	Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1673,1679 ****
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
--- 1678,1686 ----
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 
! 	if (n_common_pathkeys == list_length(pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1687,1692 ****
--- 1694,1701 ----
  		cost_sort(&sort_path,
  				  root,
  				  pathkeys,
+ 				  n_common_pathkeys,
+ 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  subpath->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2543,2551 ****
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode = makeNode(SortPath);
  
- 	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
--- 2552,2582 ----
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode;
! 	int			n_common_pathkeys;
! 
! 	if (enable_incrementalsort)
! 		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! 	else
! 		n_common_pathkeys = 0;
! 
! 	if (n_common_pathkeys == 0)
! 	{
! 		pathnode = makeNode(SortPath);
! 		pathnode->path.pathtype = T_Sort;
! 	}
! 	else
! 	{
! 		IncrementalSortPath   *incpathnode;
! 
! 		incpathnode = makeNode(IncrementalSortPath);
! 		pathnode = &incpathnode->spath;
! 		pathnode->path.pathtype = T_IncrementalSort;
! 		incpathnode->skipCols = n_common_pathkeys;
! 	}
! 
! 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2559,2565 ****
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2590,2598 ----
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2871,2877 ****
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
--- 2904,2911 ----
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL, 0,
! 						  0.0,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 1e323d9..8f01f05
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 291,297 ****
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
  												   work_mem,
! 												   qstate->rescan_needed);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 291,298 ----
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
  												   work_mem,
! 												   qstate->rescan_needed,
! 												   false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 4bbb4a8..d9c3243
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3650,3655 ****
--- 3650,3691 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucket statistics when the specified expression is used
   * as a hash key for the given number of buckets.
   *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index c4c1afa..d9195ef
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 858,863 ****
--- 858,872 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of incremental sort steps."),
+ 			NULL
+ 		},
+ 		&enable_incrementalsort,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hashed aggregation plans."),
  			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 34af8d6..a92b477
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 231,236 ****
--- 231,243 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	int64		maxSpace;		/* maximum amount of space occupied among sort
+ 								   of groups, either in-memory or on-disk */
+ 	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+ 								   space, fase when it's value for in-memory
+ 								   space */
+ 	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 573,578 ****
--- 580,588 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 607,625 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 617,646 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 636,642 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 657,663 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 654,659 ****
--- 675,681 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 694,706 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 716,729 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 742,748 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 765,771 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 773,779 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 796,802 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 864,870 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 887,893 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 939,945 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 962,968 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 981,987 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1004,1010 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1092,1107 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1115,1126 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1160,1166 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1179,1276 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	spaceUsed;
! 	bool	spaceUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		spaceUsedOnDisk = true;
! 		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		spaceUsedOnDisk = false;
! 		spaceUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	if (spaceUsed > state->maxSpace)
! 	{
! 		state->maxSpace = spaceUsed;
! 		state->maxSpaceOnDisk = spaceUsedOnDisk;
! 		state->maxSpaceStatus = state->status;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->availMem = state->allowedMem;
! 	state->lastReturnedTuple = NULL;
! 	state->slabAllocatorUsed = false;
! 	state->slabMemoryBegin = NULL;
! 	state->slabMemoryEnd = NULL;
! 	state->slabFreeHead = NULL;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 2950,2967 ****
  	 * to fix.  Is it worth creating an API for the memory context code to
  	 * tell us how much is actually used in sortcontext?
  	 */
! 	if (state->tapeset)
! 	{
  		stats->spaceType = SORT_SPACE_TYPE_DISK;
- 		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! 		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3060,3074 ----
  	 * to fix.  Is it worth creating an API for the memory context code to
  	 * tell us how much is actually used in sortcontext?
  	 */
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxSpaceOnDisk)
  		stats->spaceType = SORT_SPACE_TYPE_DISK;
  	else
  		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! 	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
  
! 	switch (state->maxSpaceStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...cfe944f
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,31 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncrementalSort.h
+  *
+  *
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/executor/nodeIncrementalSort.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+ 
+ #include "access/parallel.h"
+ #include "nodes/execnodes.h"
+ 
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+ 
+ /* parallel instrumentation support */
+ extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, shm_toc *toc);
+ extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+ 
+ #endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index e05bc04..ff019c5
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1743,1748 ****
--- 1743,1762 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 Shared memory container for per-worker sort information
   * ----------------
*************** typedef struct SortState
*** 1771,1776 ****
--- 1785,1828 ----
  	SharedSortInfo *shared_info;	/* one entry per worker */
  } SortState;
  
+ /* ----------------
+  *	 Shared memory container for per-worker incremental sort information
+  * ----------------
+  */
+ typedef struct IncrementalSortInfo
+ {
+ 	TuplesortInstrumentation	sinstrument;
+ 	int64						groupsCount;
+ } IncrementalSortInfo;
+ 
+ typedef struct SharedIncrementalSortInfo
+ {
+ 	int							num_workers;
+ 	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+ } SharedIncrementalSortInfo;
+ 
+ /* ----------------
+  *	 IncrementalSortState information
+  * ----------------
+  */
+ typedef struct IncrementalSortState
+ {
+ 	ScanState	ss;				/* its first field is NodeTag */
+ 	bool		bounded;		/* is the result set bounded? */
+ 	int64		bound;			/* if bounded, how many tuples are needed */
+ 	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
+ 	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	int64		bound_Done;		/* value of bound we did the sort with */
+ 	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	int64		groupsCount;	/* number of groups with equal skip keys */
+ 	TupleTableSlot *sampleSlot;	/* slot for sample tuple of sort group */
+ 	bool		am_worker;		/* are we a worker? */
+ 	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+ } IncrementalSortState;
+ 
  /* ---------------------
   *	GroupState information
   * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index ffeeb49..4b78045
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
  	T_HashJoin,
  	T_Material,
  	T_Sort,
+ 	T_IncrementalSort,
  	T_Group,
  	T_Agg,
  	T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
  	T_HashJoinState,
  	T_MaterialState,
  	T_SortState,
+ 	T_IncrementalSortState,
  	T_GroupState,
  	T_AggState,
  	T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
  	T_ProjectionPath,
  	T_ProjectSetPath,
  	T_SortPath,
+ 	T_IncrementalSortPath,
  	T_GroupPath,
  	T_UpperUniquePath,
  	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index a127682..2e3e5f2
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 749,754 ****
--- 749,765 ----
  	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
  } Sort;
  
+ 
+ /* ----------------
+  *		incremental sort node
+  * ----------------
+  */
+ typedef struct IncrementalSort
+ {
+ 	Sort		sort;
+ 	int			skipCols;		/* number of presorted columns */
+ } IncrementalSort;
+ 
  /* ---------------
   *	 group node -
   *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 9e68e65..f0a37e5
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1507,1512 ****
--- 1507,1522 ----
  } SortPath;
  
  /*
+  * IncrementalSortPath
+  */
+ typedef struct IncrementalSortPath
+ {
+ 	SortPath	spath;
+ 	int			skipCols;
+ } IncrementalSortPath;
+ 
+ 
+ /*
   * GroupPath represents grouping (of presorted input)
   *
   * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 6c2317d..138d951
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_bitmapscan;
  extern bool enable_tidscan;
  extern bool enable_sort;
+ extern bool enable_incrementalsort;
  extern bool enable_hashagg;
  extern bool enable_nestloop;
  extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 103,110 ****
  						 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 104,112 ----
  						 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index ea886b6..b4370e2
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 188,193 ****
--- 188,194 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 226,231 ****
--- 227,233 ----
  extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
  							  List *mergeclauses,
  							  List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
  extern List *truncate_useless_pathkeys(PlannerInfo *root,
  						  RelOptInfo *rel,
  						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 199a631..41b7196
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 206,211 ****
--- 206,214 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern void estimate_hash_bucket_stats(PlannerInfo *root,
  						   Node *hashkey, double nbuckets,
  						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index b6b8c8e..938d329
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 90,96 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 90,97 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 134,139 ****
--- 135,142 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					TuplesortInstrumentation *stats);
  extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort           
*** 19,27 ****
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Sort           
    Sort Key: id, data
!   ->  Seq Scan on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
--- 19,28 ----
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Incremental Sort
    Sort Key: id, data
!   Presorted Key: id
!   ->  Index Scan using test_dc_pkey on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index c698faf..fec6a4e
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE:  drop cascades to table matest1
*** 1515,1520 ****
--- 1515,1521 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
  SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1655,1663 ****
--- 1656,1700 ----
   {3,7,8,10,13,13,16,18,19,22}
  (3 rows)
  
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+                                QUERY PLAN                                
+ -------------------------------------------------------------------------
+  Merge Append
+    Sort Key: tenk1.thousand, tenk1.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+    ->  Incremental Sort
+          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
+          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Merge Append
+    Sort Key: a.thousand, a.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+    ->  Incremental Sort
+          Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
+          ->  Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  --
  -- Check that constraint exclusion works correctly with partitions using
  -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index cd1f7f3..5acfbbb
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select name, setting from pg_settings wh
*** 76,81 ****
--- 76,82 ----
   enable_gathermerge         | on
   enable_hashagg             | on
   enable_hashjoin            | on
+  enable_incrementalsort     | on
   enable_indexonlyscan       | on
   enable_indexscan           | on
   enable_material            | on
*************** select name, setting from pg_settings wh
*** 85,91 ****
   enable_seqscan             | on
   enable_sort                | on
   enable_tidscan             | on
! (13 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
--- 86,92 ----
   enable_seqscan             | on
   enable_sort                | on
   enable_tidscan             | on
! (14 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index 169d0dc..558246b
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 544,549 ****
--- 544,550 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
*************** SELECT
*** 605,613 ****
--- 606,631 ----
      ORDER BY f.i LIMIT 10)
  FROM generate_series(1, 3) g(i);
  
+ set enable_incrementalsort = on;
+ 
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  
  --
  -- Check that constraint exclusion works correctly with partitions using

results_100m_partial.csvtext/csv; charset=US-ASCII; name=results_100m_partial.csvDownload

#35

Antonin Houska

ah@cybertec.at

about 8 years ago

In reply to: Antonin Houska (#33)

Re: [HACKERS] [PATCH] Incremental sort

Antonin Houska <ah@cybertec.at> wrote:

Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Patch rebased to current master is attached. I'm going to improve my testing script and post new results.

I wanted to review this patch but incremental-sort-8.patch fails to apply. Can
you please rebase it again?

I could find the matching HEAD quite easily (9b6cb46), so following are my comments:

* cost_sort()

** "presorted_keys" missing in the prologue

** when called from label_sort_with_costsize(), 0 is passed for
"presorted_keys". However label_sort_with_costsize() can sometimes be
called on an IncrementalSort, in which case there are some "presorted
keys". See create_mergejoin_plan() for example. (IIUC this should only
make EXPLAIN inaccurate, but should not cause incorrect decisions.)

** instead of

if (!enable_incrementalsort)
presorted_keys = false;

you probably meant

if (!enable_incrementalsort)
presorted_keys = 0;

** instead of

/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{
PathKey *key = (PathKey *)lfirst(l);
EquivalenceMember *member = (EquivalenceMember *)
lfirst(list_head(key->pk_eclass->ec_members));

you can use linitial():

/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{
PathKey *key = (PathKey *)lfirst(l);
EquivalenceMember *member = (EquivalenceMember *)
linitial(key->pk_eclass->ec_members);

* get_cheapest_fractional_path_for_pathkeys()

The prologue says "... at least partially satisfies the given pathkeys ..."
but I see no change in the function code. In particular the use of
pathkeys_contained_in() does not allow for any kind of partial sorting.

* pathkeys_useful_for_ordering()

Extra whitespace following the comment opening string "/*":

/*
* When incremental sort is disabled, pathkeys are useful only when they

* make_sort_from_pathkeys() - the "skipCols" argument should be mentioned in
the prologue.

* create_sort_plan()

I noticed that pathkeys_common() is called, but the value of n_common_pathkeys
should already be in the path's "skipCols" field if the underlying path is
actually IncrementalSortPath.

* create_unique_plan() does not seem to make use of the incremental
sort. Shouldn't it do?

* nodeIncrementalSort.c

** These comments seem to contain typos:

"Incremental sort algorithm would sort by xfollowing groups, which have ..."

"Interate while skip cols are same as in saved tuple"

** (This is rather a pedantic comment) I think prepareSkipCols() should be
defined in front of cmpSortSkipCols().

** the MIN_GROUP_SIZE constant deserves a comment.

* ExecIncrementalSort()

** if (node->tuplesortstate == NULL)

If both branches contain the expression

node->groupsCount++;

I suggest it to be moved outside the "if" construct.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

#36

Thomas Munro

thomas.munro@enterprisedb.com

about 8 years ago

In reply to: Alexander Korotkov (#34)

Re: [HACKERS] [PATCH] Incremental sort

On Wed, Nov 15, 2017 at 7:42 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Sure, please find rebased patch attached.

+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+                                                             TupleTableSlot *b)
+ {
+     int n, i;
+
+     Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+     n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+     for (i = 0; i < n; i++)
+     {
+         Datum datumA, datumB, result;
+         bool isnullA, isnullB;
+         AttrNumber attno = node->skipKeys[i].attno;
+         SkipKeyData *key;
+
+         datumA = slot_getattr(a, attno, &isnullA);
+         datumB = slot_getattr(b, attno, &isnullB);
+
+         /* Special case for NULL-vs-NULL, else use standard comparison */
+         if (isnullA || isnullB)
+         {
+             if (isnullA == isnullB)
+                 continue;
+             else
+                 return false;
+         }
+
+         key = &node->skipKeys[i];
+
+         key->fcinfo.arg[0] = datumA;
+         key->fcinfo.arg[1] = datumB;
+
+         /* just for paranoia's sake, we reset isnull each time */
+         key->fcinfo.isnull = false;
+
+         result = FunctionCallInvoke(&key->fcinfo);
+
+         /* Check for null result, since caller is clearly not expecting one */
+         if (key->fcinfo.isnull)
+             elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+         if (!DatumGetBool(result))
+             return false;
+     }
+     return true;
+ }

Is there some reason not to use ApplySortComparator for this? I think
you're missing out on lower-overhead comparators, and in any case it's
probably good code reuse, no?

Embarrassingly, I was unaware of this patch and started prototyping
exactly the same thing independently[1]https://github.com/macdice/postgres/commit/ab0f8aff9c4b25d5598aa6b3c630df864302f572. I hadn't got very far and
will now abandon that, but that's one thing I did differently. Two
other things that may be different: I had a special case for groups of
size 1 that skipped the sorting, and I only sorted on the suffix
because I didn't put tuples with different prefixes into the sorter (I
was assuming that tuplesort_reset was going to be super efficient,
though I hadn't got around to writing that). I gather that you have
determined empirically that it's better to be able to sort groups of
at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
leading attributes, but why is that the case?

[1]: https://github.com/macdice/postgres/commit/ab0f8aff9c4b25d5598aa6b3c630df864302f572

--
Thomas Munro
http://www.enterprisedb.com

#37

Alexander Korotkov

a.korotkov@postgrespro.ru

about 8 years ago

In reply to: Antonin Houska (#35)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

Hi!

Thank you very much for review. I really appreciate this topic gets
attention. Please, find next revision of patch in the attachment.

On Wed, Nov 15, 2017 at 7:20 PM, Antonin Houska <ah@cybertec.at> wrote:

Antonin Houska <ah@cybertec.at> wrote:

Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Patch rebased to current master is attached. I'm going to improve my

testing script and post new results.

I wanted to review this patch but incremental-sort-8.patch fails to

apply. Can

you please rebase it again?

I could find the matching HEAD quite easily (9b6cb46), so following are my
comments:

* cost_sort()

** "presorted_keys" missing in the prologue

Comment is added.

** when called from label_sort_with_costsize(), 0 is passed for

"presorted_keys". However label_sort_with_costsize() can sometimes be
called on an IncrementalSort, in which case there are some "presorted
keys". See create_mergejoin_plan() for example. (IIUC this should only
make EXPLAIN inaccurate, but should not cause incorrect decisions.)

Good catch. Fixed.

** instead of

if (!enable_incrementalsort)
presorted_keys = false;

you probably meant

if (!enable_incrementalsort)
presorted_keys = 0;

Absolutely correct. Fixed.

** instead of

/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{
PathKey *key = (PathKey *)lfirst(l);
EquivalenceMember *member = (EquivalenceMember *)
lfirst(list_head(key->pk_eclass->ec_members));

you can use linitial():

/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{
PathKey *key = (PathKey *)lfirst(l);
EquivalenceMember *member = (EquivalenceMember *)
linitial(key->pk_eclass->ec_members);

Sure. Fixed.

* get_cheapest_fractional_path_for_pathkeys()

The prologue says "... at least partially satisfies the given pathkeys ..."
but I see no change in the function code. In particular the use of
pathkeys_contained_in() does not allow for any kind of partial sorting.

Good catch. This is a part of optimization for build_minmax_path() which
existed in earlier version of patch. That optimization contained set of
arguable solutions. This is why I decided to wipe it out from the patch,
and let it wait for initial implementation to be committed.

* pathkeys_useful_for_ordering()

Extra whitespace following the comment opening string "/*":

/*
* When incremental sort is disabled, pathkeys are useful only when they

Fixed.

* make_sort_from_pathkeys() - the "skipCols" argument should be mentioned
in
the prologue.

Comment is added.

* create_sort_plan()

I noticed that pathkeys_common() is called, but the value of
n_common_pathkeys
should already be in the path's "skipCols" field if the underlying path is
actually IncrementalSortPath.

Sounds like reasonable optimization. Done.

* create_unique_plan() does not seem to make use of the incremental
sort. Shouldn't it do?

It definitely should. But proper solution doesn't seem to be easy for me.
We should construct possibly useful paths before. Wherein it should be
done in agnostic manner to the order of pathkeys. I'm afraid for possible
regression in query planning. Therefore, it seems like a topic for
separate discussion. I would prefer to commit some basic implementation
first and then consider smaller patches with possible enhancement including
this one.

* nodeIncrementalSort.c

** These comments seem to contain typos:

"Incremental sort algorithm would sort by xfollowing groups, which have
..."

"Interate while skip cols are same as in saved tuple"

Fixed.

** (This is rather a pedantic comment) I think prepareSkipCols() should be
defined in front of cmpSortSkipCols().

That's a good comment. We're trying to be as pedantic about code as we can
:)
Fixed.

** the MIN_GROUP_SIZE constant deserves a comment.

Sure. Explanation was added.

* ExecIncrementalSort()

** if (node->tuplesortstate == NULL)

If both branches contain the expression

node->groupsCount++;

I suggest it to be moved outside the "if" construct.

Done.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-10.patchapplication/octet-stream; name=incremental-sort-10.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 4339bbf..df72ab1
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1981,2019 ****
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
   Limit
!    Output: t1.c1, t2.c1
     ->  Sort
!          Output: t1.c1, t2.c1
!          Sort Key: t1.c1, t2.c1
           ->  Nested Loop
!                Output: t1.c1, t2.c1
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c1
!                      Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c1
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c1
!                            Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!  c1 | c1  
! ----+-----
!   1 | 101
!   1 | 102
!   1 | 103
!   1 | 104
!   1 | 105
!   1 | 106
!   1 | 107
!   1 | 108
!   1 | 109
!   1 | 110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
--- 1981,2019 ----
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!                             QUERY PLAN                            
! ------------------------------------------------------------------
   Limit
!    Output: t1.c3, t2.c3
     ->  Sort
!          Output: t1.c3, t2.c3
!          Sort Key: t1.c3, t2.c3
           ->  Nested Loop
!                Output: t1.c3, t2.c3
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c3
!                      Remote SQL: SELECT c3 FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c3
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c3
!                            Remote SQL: SELECT c3 FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!   c3   |  c3   
! -------+-------
!  00001 | 00101
!  00001 | 00102
!  00001 | 00103
!  00001 | 00104
!  00001 | 00105
!  00001 | 00106
!  00001 | 00107
!  00001 | 00108
!  00001 | 00109
!  00001 | 00110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index ddfec79..c8c6fb7
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index fc1752f..291360f
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3552,3557 ****
--- 3552,3571 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+       <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of incremental sort
+         steps. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
        <term><varname>enable_indexscan</varname> (<type>boolean</type>)
        <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 447f69d..a646d82
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual, 
*** 80,85 ****
--- 80,87 ----
  				ExplainState *es);
  static void show_sort_keys(SortState *sortstate, List *ancestors,
  			   ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 					   List *ancestors, ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
  static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
  				 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 									   ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  					ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1011,1016 ****
--- 1015,1023 ----
  		case T_Sort:
  			pname = sname = "Sort";
  			break;
+ 		case T_IncrementalSort:
+ 			pname = sname = "Incremental Sort";
+ 			break;
  		case T_Group:
  			pname = sname = "Group";
  			break;
*************** ExplainNode(PlanState *planstate, List *
*** 1611,1616 ****
--- 1618,1629 ----
  			show_sort_keys(castNode(SortState, planstate), ancestors, es);
  			show_sort_info(castNode(SortState, planstate), es);
  			break;
+ 		case T_IncrementalSort:
+ 			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ 									   ancestors, es);
+ 			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ 									   es);
+ 			break;
  		case T_MergeAppend:
  			show_merge_append_keys(castNode(MergeAppendState, planstate),
  								   ancestors, es);
*************** static void
*** 1936,1950 ****
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
   * Likewise, for a MergeAppend node.
   */
  static void
--- 1949,1986 ----
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+ 	int			skipCols;
+ 
+ 	if (IsA(plan, IncrementalSort))
+ 		skipCols = ((IncrementalSort *) plan)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
+  * Show the sort keys for a IncrementalSort node.
+  */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 						   List *ancestors, ExplainState *es)
+ {
+ 	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+ 
+ 	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ 						 plan->sort.numCols, plan->skipCols,
+ 						 plan->sort.sortColIdx,
+ 						 plan->sort.sortOperators, plan->sort.collations,
+ 						 plan->sort.nullsFirst,
+ 						 ancestors, es);
+ }
+ 
+ /*
   * Likewise, for a MergeAppend node.
   */
  static void
*************** show_merge_append_keys(MergeAppendState 
*** 1954,1960 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1990,1996 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1978,1984 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 2014,2020 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 2047,2053 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 2083,2089 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2104,2110 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 2140,2146 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2117,2129 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 2153,2166 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2163,2171 ****
--- 2200,2212 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2374,2379 ****
--- 2415,2509 ----
  }
  
  /*
+  * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+  */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 						   ExplainState *es)
+ {
+ 	if (es->analyze && incrsortstate->sort_Done &&
+ 		incrsortstate->tuplesortstate != NULL)
+ 	{
+ 		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ 		TuplesortInstrumentation stats;
+ 		const char *sortMethod;
+ 		const char *spaceType;
+ 		long		spaceUsed;
+ 
+ 		tuplesort_get_stats(state, &stats);
+ 		sortMethod = tuplesort_method_name(stats.sortMethod);
+ 		spaceType = tuplesort_space_type_name(stats.spaceType);
+ 		spaceUsed = stats.spaceUsed;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+ 							 sortMethod, spaceType, spaceUsed);
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Groups: %ld\n",
+ 							 incrsortstate->groupsCount);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyText("Sort Method", sortMethod, es);
+ 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			ExplainPropertyLong("Sort Groups: %ld",
+ 								incrsortstate->groupsCount, es);
+ 		}
+ 	}
+ 
+ 	if (incrsortstate->shared_info != NULL)
+ 	{
+ 		int			n;
+ 		bool		opened_group = false;
+ 
+ 		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ 		{
+ 			TuplesortInstrumentation *sinstrument;
+ 			const char *sortMethod;
+ 			const char *spaceType;
+ 			long		spaceUsed;
+ 			int64		groupsCount;
+ 
+ 			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ 			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ 			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ 				continue;		/* ignore any unfilled slots */
+ 			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ 			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ 			spaceUsed = sinstrument->spaceUsed;
+ 
+ 			if (es->format == EXPLAIN_FORMAT_TEXT)
+ 			{
+ 				appendStringInfoSpaces(es->str, es->indent * 2);
+ 				appendStringInfo(es->str,
+ 								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+ 								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+ 			}
+ 			else
+ 			{
+ 				if (!opened_group)
+ 				{
+ 					ExplainOpenGroup("Workers", "Workers", false, es);
+ 					opened_group = true;
+ 				}
+ 				ExplainOpenGroup("Worker", NULL, true, es);
+ 				ExplainPropertyInteger("Worker Number", n, es);
+ 				ExplainPropertyText("Sort Method", sortMethod, es);
+ 				ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 				ExplainPropertyText("Sort Space Type", spaceType, es);
+ 				ExplainPropertyLong("Sort Groups", groupsCount, es);
+ 				ExplainCloseGroup("Worker", NULL, true, es);
+ 			}
+ 		}
+ 		if (opened_group)
+ 			ExplainCloseGroup("Workers", "Workers", false, es);
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index cc09895..572aca0
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
!        nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
!        nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index f1636a5..dd8cffe
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 31,36 ****
--- 31,37 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 253,258 ****
--- 254,263 ----
  			ExecReScanSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecReScanIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecReScanGroup((GroupState *) node);
  			break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 525,532 ****
--- 530,541 ----
  		case T_CteScan:
  		case T_Material:
  		case T_Sort:
+ 			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_IncrementalSort:
+ 			return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index 53c5254..f3d6876
*** a/src/backend/executor/execParallel.c
--- b/src/backend/executor/execParallel.c
***************
*** 29,34 ****
--- 29,35 ----
  #include "executor/nodeBitmapHeapscan.h"
  #include "executor/nodeCustom.h"
  #include "executor/nodeForeignscan.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeSeqscan.h"
*************** ExecParallelEstimate(PlanState *planstat
*** 263,268 ****
--- 264,273 ----
  			/* even when not parallel-aware */
  			ExecSortEstimate((SortState *) planstate, e->pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelInitializeDSM(PlanState *pla
*** 462,467 ****
--- 467,476 ----
  			/* even when not parallel-aware */
  			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelReInitializeDSM(PlanState *p
*** 876,881 ****
--- 885,894 ----
  			/* even when not parallel-aware */
  			ExecSortReInitializeDSM((SortState *) planstate, pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelRetrieveInstrumentation(Plan
*** 934,939 ****
--- 947,954 ----
  	 */
  	if (IsA(planstate, SortState))
  		ExecSortRetrieveInstrumentation((SortState *) planstate);
+ 	else if (IsA(planstate, IncrementalSortState))
+ 		ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
  
  	return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
  								 instrumentation);
*************** ExecParallelInitializeWorker(PlanState *
*** 1164,1169 ****
--- 1179,1189 ----
  			/* even when not parallel-aware */
  			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+ 												pwcxt);
+ 			break;
  
  		default:
  			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index c1aa506..e4225df
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 88,93 ****
--- 88,94 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 314,319 ****
--- 315,325 ----
  												estate, eflags);
  			break;
  
+ 		case T_IncrementalSort:
+ 			result = (PlanState *) ExecInitIncrementalSort(
+ 									(IncrementalSort *) node, estate, eflags);
+ 			break;
+ 
  		case T_Group:
  			result = (PlanState *) ExecInitGroup((Group *) node,
  												 estate, eflags);
*************** ExecEndNode(PlanState *node)
*** 679,684 ****
--- 685,694 ----
  			ExecEndSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecEndIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecEndGroup((GroupState *) node);
  			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index d26ce08..3c37bda
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 666,671 ****
--- 666,672 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 753,759 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 754,760 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...1a1e48f
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,649 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncremenalSort.c
+  *	  Routines to handle incremental sorting of relations.
+  *
+  * DESCRIPTION
+  *
+  *		Incremental sort is specially optimized kind of multikey sort when
+  *		input is already presorted by prefix of required keys list.  Thus,
+  *		when it's required to sort by (key1, key2 ... keyN) and result is
+  *		already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+  *		values of (key1, key2 ... keyM) are equal.
+  *
+  *		Consider following example.  We have input tuples consisting from
+  *		two integers (x, y) already presorted by x, while it's required to
+  *		sort them by x and y.  Let input tuples be following.
+  *
+  *		(1, 5)
+  *		(1, 2)
+  *		(2, 10)
+  *		(2, 1)
+  *		(2, 5)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort algorithm would sort by y following groups, which have
+  *		equal x, individually:
+  *			(1, 5) (1, 2)
+  *			(2, 10) (2, 1) (2, 5)
+  *			(3, 3) (3, 7)
+  *
+  *		After sorting these groups and putting them altogether, we would get
+  *		following tuple set which is actually sorted by x and y.
+  *
+  *		(1, 2)
+  *		(1, 5)
+  *		(2, 1)
+  *		(2, 5)
+  *		(2, 10)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort is faster than full sort on large datasets.  But
+  *		the case of most huge benefit of incremental sort is queries with
+  *		LIMIT because incremental sort can return first tuples without reading
+  *		whole input dataset.
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/executor/nodeIncremenalSort.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	int					skipCols,
+ 						i;
+ 
+ 	Assert(IsA(plannode, IncrementalSort));
+ 	skipCols = plannode->skipCols;
+ 
+ 	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sort.sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 										plannode->sort.sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sort.sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->sort.collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ 															TupleTableSlot *b)
+ {
+ 	int n, i;
+ 
+ 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+ 
+ 	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = slot_getattr(a, attno, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Copying of tuples to the node->sampleSlot introduces some overhead.  It's
+  * especially notable when groups are containing one or few tuples.  In order
+  * to cope this problem we don't copy sample tuple before the group contains
+  * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+  * incremental sort, but it reduces the probability of regression.
+  */
+ #define MIN_GROUP_SIZE 32
+ 
+ /* ----------------------------------------------------------------
+  *		ExecIncrementalSort
+  *
+  *		Assuming that outer subtree returns tuple presorted by some prefix
+  *		of target sort columns, performs incremental sort.  It fetches
+  *		groups of tuples where prefix sort columns are equal and sorts them
+  *		using tuplesort.  This approach allows to evade sorting of whole
+  *		dataset.  Besides taking less memory and being faster, it allows to
+  *		start returning tuples before fetching full dataset from outer
+  *		subtree.
+  *
+  *		Conditions:
+  *		  -- none.
+  *
+  *		Initial States:
+  *		  -- the outer child is prepared to return the first tuple.
+  * ----------------------------------------------------------------
+  */
+ static TupleTableSlot *
+ ExecIncrementalSort(PlanState *pstate)
+ {
+ 	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ 	EState			   *estate;
+ 	ScanDirection		dir;
+ 	Tuplesortstate	   *tuplesortstate;
+ 	TupleTableSlot	   *slot;
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	PlanState		   *outerNode;
+ 	TupleDesc			tupDesc;
+ 	int64				nTuples = 0;
+ 
+ 	/*
+ 	 * get state info from node
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "entering routine");
+ 
+ 	estate = node->ss.ps.state;
+ 	dir = estate->es_direction;
+ 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+ 
+ 	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  false, slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
+ 	 * If first time through, read all tuples from outer plan and pass them to
+ 	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ 	 */
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "sorting subplan");
+ 
+ 	/*
+ 	 * Want to scan subplan in the forward direction while creating the
+ 	 * sorted data.
+ 	 */
+ 	estate->es_direction = ForwardScanDirection;
+ 
+ 	/*
+ 	 * Initialize tuplesort module.
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "calling tuplesort_begin");
+ 
+ 	outerNode = outerPlanState(node);
+ 	tupDesc = ExecGetResultType(outerNode);
+ 
+ 	if (node->tuplesortstate == NULL)
+ 	{
+ 		/*
+ 		 * We are going to process the first group of presorted data.
+ 		 * Initialize support structures for cmpSortSkipCols - already
+ 		 * sorted columns.
+ 		 */
+ 		prepareSkipCols(node);
+ 
+ 		/*
+ 		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+ 		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+ 		 * necessary have equal value of the first column.  We unlikely will
+ 		 * have huge groups with incremental sort.  Therefore usage of
+ 		 * abbreviated keys would be likely a waste of time.
+ 		 */
+ 		tuplesortstate = tuplesort_begin_heap(
+ 									tupDesc,
+ 									plannode->sort.numCols,
+ 									plannode->sort.sortColIdx,
+ 									plannode->sort.sortOperators,
+ 									plannode->sort.collations,
+ 									plannode->sort.nullsFirst,
+ 									work_mem,
+ 									false,
+ 									true);
+ 		node->tuplesortstate = (void *) tuplesortstate;
+ 	}
+ 	else
+ 	{
+ 		/* Next group of presorted data */
+ 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ 	}
+ 	node->groupsCount++;
+ 
+ 	/* Calculate remaining bound for bounded sort */
+ 	if (node->bounded)
+ 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ 
+ 	/* Put saved tuple to tuplesort if any */
+ 	if (!TupIsNull(node->sampleSlot))
+ 	{
+ 		tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ 		ExecClearTuple(node->sampleSlot);
+ 		nTuples++;
+ 	}
+ 
+ 	/*
+ 	 * Put next group of tuples where skipCols sort values are equal to
+ 	 * tuplesort.
+ 	 */
+ 	for (;;)
+ 	{
+ 		slot = ExecProcNode(outerNode);
+ 
+ 		if (TupIsNull(slot))
+ 		{
+ 			node->finished = true;
+ 			break;
+ 		}
+ 
+ 		/* Put next group of presorted data to the tuplesort */
+ 		if (nTuples < MIN_GROUP_SIZE)
+ 		{
+ 			tuplesort_puttupleslot(tuplesortstate, slot);
+ 
+ 			/* Save last tuple in minimal group */
+ 			if (nTuples == MIN_GROUP_SIZE - 1)
+ 				ExecCopySlot(node->sampleSlot, slot);
+ 			nTuples++;
+ 		}
+ 		else
+ 		{
+ 			/* Iterate while skip cols are the same as in saved tuple */
+ 			bool	cmp;
+ 			cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+ 
+ 			if (cmp)
+ 			{
+ 				tuplesort_puttupleslot(tuplesortstate, slot);
+ 				nTuples++;
+ 			}
+ 			else
+ 			{
+ 				ExecCopySlot(node->sampleSlot, slot);
+ 				break;
+ 			}
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Complete the sort.
+ 	 */
+ 	tuplesort_performsort(tuplesortstate);
+ 
+ 	/*
+ 	 * restore to user specified direction
+ 	 */
+ 	estate->es_direction = dir;
+ 
+ 	/*
+ 	 * finally set the sorted flag to true
+ 	 */
+ 	node->sort_Done = true;
+ 	node->bounded_Done = node->bounded;
+ 	if (node->shared_info && node->am_worker)
+ 	{
+ 		TuplesortInstrumentation *si;
+ 
+ 		Assert(IsParallelWorker());
+ 		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ 		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ 		tuplesort_get_stats(tuplesortstate, si);
+ 		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ 															node->groupsCount;
+ 	}
+ 
+ 	/*
+ 	 * Adjust bound_Done with number of tuples we've actually sorted.
+ 	 */
+ 	if (node->bounded)
+ 	{
+ 		if (node->finished)
+ 			node->bound_Done = node->bound;
+ 		else
+ 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ 	}
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "retrieving tuple from tuplesort");
+ 
+ 	/*
+ 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+ 	 * tuples.
+ 	 */
+ 	slot = node->ss.ps.ps_ResultTupleSlot;
+ 	(void) tuplesort_gettupleslot(tuplesortstate,
+ 								  ScanDirectionIsForward(dir),
+ 								  false, slot, NULL);
+ 	return slot;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitIncrementalSort
+  *
+  *		Creates the run-time state information for the sort node
+  *		produced by the planner and initializes its outer subtree.
+  * ----------------------------------------------------------------
+  */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ 	IncrementalSortState   *incrsortstate;
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "initializing sort node");
+ 
+ 	/*
+ 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ 	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ 	 * bucket in tuplesortstate.
+ 	 */
+ 	Assert((eflags & (EXEC_FLAG_REWIND |
+ 					  EXEC_FLAG_BACKWARD |
+ 					  EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
+ 	 * create state structure
+ 	 */
+ 	incrsortstate = makeNode(IncrementalSortState);
+ 	incrsortstate->ss.ps.plan = (Plan *) node;
+ 	incrsortstate->ss.ps.state = estate;
+ 	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+ 
+ 	incrsortstate->bounded = false;
+ 	incrsortstate->sort_Done = false;
+ 	incrsortstate->finished = false;
+ 	incrsortstate->tuplesortstate = NULL;
+ 	incrsortstate->sampleSlot = NULL;
+ 	incrsortstate->bound_Done = 0;
+ 	incrsortstate->groupsCount = 0;
+ 	incrsortstate->skipKeys = NULL;
+ 
+ 	/*
+ 	 * Miscellaneous initialization
+ 	 *
+ 	 * Sort nodes don't initialize their ExprContexts because they never call
+ 	 * ExecQual or ExecProject.
+ 	 */
+ 
+ 	/*
+ 	 * tuple table initialization
+ 	 *
+ 	 * sort nodes only return scan tuples from their sorted relation.
+ 	 */
+ 	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ 	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+ 
+ 	/*
+ 	 * initialize child nodes
+ 	 *
+ 	 * We shield the child node from the need to support REWIND, BACKWARD, or
+ 	 * MARK/RESTORE.
+ 	 */
+ 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+ 
+ 	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+ 
+ 	/*
+ 	 * initialize tuple type.  no need to initialize projection info because
+ 	 * this node doesn't do projections.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ 	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+ 
+ 	/* make standalone slot to store previous tuple from outer node */
+ 	incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ 							ExecGetResultType(outerPlanState(incrsortstate)));
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "sort node initialized");
+ 
+ 	return incrsortstate;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecEndIncrementalSort(node)
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "shutting down sort node");
+ 
+ 	/*
+ 	 * clean out the tuple table
+ 	 */
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	/* must drop stanalone tuple slot from outer node */
+ 	ExecDropSingleTupleTableSlot(node->sampleSlot);
+ 
+ 	/*
+ 	 * Release tuplesort resources
+ 	 */
+ 	if (node->tuplesortstate != NULL)
+ 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 
+ 	/*
+ 	 * shut down the subplan
+ 	 */
+ 	ExecEndNode(outerPlanState(node));
+ 
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "sort node shutdown");
+ }
+ 
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ 	PlanState  *outerPlan = outerPlanState(node);
+ 
+ 	/*
+ 	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ 	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ 	 * re-scan it at all.
+ 	 */
+ 	if (!node->sort_Done)
+ 		return;
+ 
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 
+ 	/*
+ 	 * If subnode is to be rescanned then we forget previous sort results; we
+ 	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+ 	 * bounded-sort parameters changed or we didn't select randomAccess.
+ 	 *
+ 	 * Otherwise we can just rewind and rescan the sorted output.
+ 	 */
+ 	node->sort_Done = false;
+ 	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 	node->bound_Done = 0;
+ 
+ 	/*
+ 	 * if chgParam of subnode is not null then plan will be re-scanned by
+ 	 * first ExecProcNode.
+ 	 */
+ 	if (outerPlan->chgParam == NULL)
+ 		ExecReScan(outerPlan);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *						Parallel Query Support
+  * ----------------------------------------------------------------
+  */
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortEstimate
+  *
+  *		Estimate space required to propagate sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	Size		size;
+ 
+ 	/* don't need this if not instrumenting or no workers */
+ 	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ 		return;
+ 
+ 	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ 	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ 	shm_toc_estimate_chunk(&pcxt->estimator, size);
+ 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortInitializeDSM
+  *
+  *		Initialize DSM space for sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	Size		size;
+ 
+ 	/* don't need this if not instrumenting or no workers */
+ 	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ 		return;
+ 
+ 	size = offsetof(SharedIncrementalSortInfo, sinfo)
+ 		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+ 	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ 	/* ensure any unfilled slots will contain zeroes */
+ 	memset(node->shared_info, 0, size);
+ 	node->shared_info->num_workers = pcxt->nworkers;
+ 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ 				   node->shared_info);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortReInitializeDSM
+  *
+  *		Reset shared state before beginning a fresh scan.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	/* If there's any instrumentation space, clear it for next time */
+ 	if (node->shared_info != NULL)
+ 	{
+ 		memset(node->shared_info->sinfo, 0,
+ 			   node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ 	}
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortInitializeWorker
+  *
+  *		Attach worker to DSM space for sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+ {
+ 	node->shared_info =
+ 		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+ 	node->am_worker = true;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortRetrieveInstrumentation
+  *
+  *		Transfer sort statistics from DSM to private memory.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+ {
+ 	Size		size;
+ 	SharedIncrementalSortInfo *si;
+ 
+ 	if (node->shared_info == NULL)
+ 		return;
+ 
+ 	size = offsetof(SharedIncrementalSortInfo, sinfo)
+ 		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ 	si = palloc(size);
+ 	memcpy(si, node->shared_info, size);
+ 	node->shared_info = si;
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 73aa371..ef3587c
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(PlanState *pstate)
*** 93,99 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
--- 93,100 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index d9ff8a7..417a8d2
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 919,924 ****
--- 919,942 ----
  
  
  /*
+  * CopySortFields
+  *
+  *		This function copies the fields of the Sort node.  It is used by
+  *		all the copy functions for classes which inherit from Sort.
+  */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ 
+ 	COPY_SCALAR_FIELD(numCols);
+ 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+ 
+ /*
   * _copySort
   */
  static Sort *
*************** _copySort(const Sort *from)
*** 929,941 ****
  	/*
  	 * copy node superclass fields
  	 */
! 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
! 	COPY_SCALAR_FIELD(numCols);
! 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
  
  	return newnode;
  }
--- 947,975 ----
  	/*
  	 * copy node superclass fields
  	 */
! 	CopySortFields(from, newnode);
  
! 	return newnode;
! }
! 
! 
! /*
!  * _copyIncrementalSort
!  */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! 	IncrementalSort	   *newnode = makeNode(IncrementalSort);
! 
! 	/*
! 	 * copy node superclass fields
! 	 */
! 	CopySortFields((const Sort *) from, (Sort *) newnode);
! 
! 	/*
! 	 * copy remainder of node
! 	 */
! 	COPY_SCALAR_FIELD(skipCols);
  
  	return newnode;
  }
*************** copyObjectImpl(const void *from)
*** 4803,4808 ****
--- 4837,4845 ----
  		case T_Sort:
  			retval = _copySort(from);
  			break;
+ 		case T_IncrementalSort:
+ 			retval = _copyIncrementalSort(from);
+ 			break;
  		case T_Group:
  			retval = _copyGroup(from);
  			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index c97ee24..6cb9300
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 869,880 ****
  }
  
  static void
! _outSort(StringInfo str, const Sort *node)
  {
  	int			i;
  
- 	WRITE_NODE_TYPE("SORT");
- 
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
--- 869,878 ----
  }
  
  static void
! _outSortInfo(StringInfo str, const Sort *node)
  {
  	int			i;
  
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 897,902 ****
--- 895,918 ----
  }
  
  static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ 	WRITE_NODE_TYPE("SORT");
+ 
+ 	_outSortInfo(str, node);
+ }
+ 
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ 	WRITE_NODE_TYPE("INCREMENTALSORT");
+ 
+ 	_outSortInfo(str, (const Sort *) node);
+ 
+ 	WRITE_INT_FIELD(skipCols);
+ }
+ 
+ static void
  _outUnique(StringInfo str, const Unique *node)
  {
  	int			i;
*************** outNode(StringInfo str, const void *obj)
*** 3737,3742 ****
--- 3753,3761 ----
  			case T_Sort:
  				_outSort(str, obj);
  				break;
+ 			case T_IncrementalSort:
+ 				_outIncrementalSort(str, obj);
+ 				break;
  			case T_Unique:
  				_outUnique(str, obj);
  				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 7eb67fc..f2b0e75
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2059,2070 ****
  }
  
  /*
!  * _readSort
   */
! static Sort *
! _readSort(void)
  {
! 	READ_LOCALS(Sort);
  
  	ReadCommonPlan(&local_node->plan);
  
--- 2059,2071 ----
  }
  
  /*
!  * ReadCommonSort
!  *	Assign the basic stuff of all nodes that inherit from Sort
   */
! static void
! ReadCommonSort(Sort *local_node)
  {
! 	READ_TEMP_LOCALS();
  
  	ReadCommonPlan(&local_node->plan);
  
*************** _readSort(void)
*** 2073,2078 ****
--- 2074,2105 ----
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
  	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+ 
+ /*
+  * _readSort
+  */
+ static Sort *
+ _readSort(void)
+ {
+ 	READ_LOCALS_NO_FIELDS(Sort);
+ 
+ 	ReadCommonSort(local_node);
+ 
+ 	READ_DONE();
+ }
+ 
+ /*
+  * _readIncrementalSort
+  */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ 	READ_LOCALS(IncrementalSort);
+ 
+ 	ReadCommonSort(&local_node->sort);
+ 
+ 	READ_INT_FIELD(skipCols);
  
  	READ_DONE();
  }
*************** parseNodeString(void)
*** 2634,2639 ****
--- 2661,2668 ----
  		return_value = _readMaterial();
  	else if (MATCH("SORT", 4))
  		return_value = _readSort();
+ 	else if (MATCH("INCREMENTALSORT", 7))
+ 		return_value = _readIncrementalSort();
  	else if (MATCH("GROUP", 5))
  		return_value = _readGroup();
  	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 906d08a..28f2b74
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3459,3464 ****
--- 3459,3468 ----
  			ptype = "Sort";
  			subpath = ((SortPath *) path)->subpath;
  			break;
+ 		case T_IncrementalSortPath:
+ 			ptype = "IncrementalSort";
+ 			subpath = ((SortPath *) path)->subpath;
+ 			break;
  		case T_GroupPath:
  			ptype = "Group";
  			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index d11bf19..2f7cf60
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
  bool		enable_bitmapscan = true;
  bool		enable_tidscan = true;
  bool		enable_sort = true;
+ bool		enable_incrementalsort = true;
  bool		enable_hashagg = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
*************** cost_recursive_union(Path *runion, Path 
*** 1601,1606 ****
--- 1602,1614 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or incremental sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1627,1633 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1635,1643 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'presorted_keys' is a number of pathkeys already presorted in given path
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1643,1661 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
  
  	path->rows = tuples;
  
--- 1653,1680 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
+ 	if (!enable_incrementalsort)
+ 		presorted_keys = 0;
  
  	path->rows = tuples;
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1681,1693 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1700,1749 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 										linitial(key->pk_eclass->ec_members);
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1697,1703 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1753,1759 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1708,1717 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1764,1773 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1719,1732 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1775,1807 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/*
! 		 * We'll use plain quicksort on all the input tuples.  If it appears
! 		 * that we expect less than two tuples per sort group then assume
! 		 * logarithmic part of estimate to be 1.
! 		 */
! 		if (group_tuples >= 2.0)
! 			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! 		else
! 			group_cost = comparison_cost * group_tuples;
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1737,1742 ****
--- 1812,1830 ----
  	 */
  	run_cost += cpu_operator_cost * tuples;
  
+ 	/* Extra costs of incremental sort */
+ 	if (presorted_keys > 0)
+ 	{
+ 		/*
+ 		 * In incremental sort case we also have to cost to detect sort groups.
+ 		 * It turns out into extra copy and comparison for each tuple.
+ 		 */
+ 		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+ 
+ 		/* Cost of per group tuplesort reset */
+ 		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ 	}
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2544,2549 ****
--- 2632,2639 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2570,2575 ****
--- 2660,2667 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index c6870d3..b97f22a
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
  	return PATHKEYS_EQUAL;
  }
  
+ 
+ /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
  /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
! 	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
- 
- 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1517,1558 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by incremental sort.
   */
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
  {
! 	int	n_common_pathkeys;
! 
! 	if (query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
! 
! 	if (enable_incrementalsort)
  	{
! 		/*
! 		 * Return the number of path keys in common, or 0 if there are none. Any
! 		 * first common pathkeys could be useful for ordering because we can use
! 		 * incremental sort.
! 		 */
! 		return n_common_pathkeys;
! 	}
! 	else
! 	{
! 		/*
! 		 * When incremental sort is disabled, pathkeys are useful only when they
! 		 * do contain all the query pathkeys.
! 		 */
! 		if (n_common_pathkeys == list_length(query_pathkeys))
! 			return n_common_pathkeys;
! 		else
! 			return 0;
  	}
  }
  
  /*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
--- 1568,1574 ----
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index d445477..b080fa6
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 235,241 ****
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 235,241 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static EquivalenceMember *find_ec_member
*** 251,260 ****
  					   TargetEntry *tle,
  					   Relids relids);
  static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						Relids relids);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 251,261 ----
  					   TargetEntry *tle,
  					   Relids relids);
  static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						Relids relids, int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 436,441 ****
--- 437,443 ----
  											   (GatherPath *) best_path);
  			break;
  		case T_Sort:
+ 		case T_IncrementalSort:
  			plan = (Plan *) create_sort_plan(root,
  											 (SortPath *) best_path,
  											 flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1120,1125 ****
--- 1122,1128 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1154,1162 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1157,1167 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1506,1511 ****
--- 1511,1517 ----
  	Plan	   *subplan;
  	List	   *pathkeys = best_path->path.pathkeys;
  	List	   *tlist = build_path_tlist(root, &best_path->path);
+ 	int			n_common_pathkeys;
  
  	/* As with Gather, it's best to project away columns in the workers. */
  	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1535,1546 ****
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
--- 1541,1556 ----
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! 	if (n_common_pathkeys < list_length(pathkeys))
! 	{
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ 									 n_common_pathkeys,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
+ 	}
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1653,1658 ****
--- 1663,1669 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1662,1668 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1673,1685 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	if (IsA(best_path, IncrementalSortPath))
! 		n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
! 	else
! 		n_common_pathkeys = 0;
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   NULL, n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1906,1912 ****
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan);
  			}
  
  			if (!rollup->is_hashed)
--- 1923,1930 ----
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan,
! 											 0);
  			}
  
  			if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3846,3855 ****
  	 */
  	if (best_path->outersortkeys)
  	{
  		Relids		outer_relids = outer_path->parent->relids;
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys,
! 												   outer_relids);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3864,3878 ----
  	 */
  	if (best_path->outersortkeys)
  	{
+ 		Sort	   *sort;
+ 		int			n_common_pathkeys;
  		Relids		outer_relids = outer_path->parent->relids;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   outer_relids, n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3860,3869 ****
  
  	if (best_path->innersortkeys)
  	{
  		Relids		inner_relids = inner_path->parent->relids;
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys,
! 												   inner_relids);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3883,3897 ----
  
  	if (best_path->innersortkeys)
  	{
+ 		Sort	   *sort;
+ 		int			n_common_pathkeys;
  		Relids		inner_relids = inner_path->parent->relids;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   inner_relids, n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4914,4921 ****
  {
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4942,4954 ----
  {
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
+ 	int			skip_cols = 0;
  
! 	if (IsA(plan, IncrementalSort))
! 		skip_cols = ((IncrementalSort *) plan)->skipCols;
! 
! 	cost_sort(&sort_path, root, NIL, skip_cols,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5504,5516 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node = makeNode(Sort);
! 	Plan	   *plan = &node->plan;
  
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
--- 5537,5567 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node;
! 	Plan	   *plan;
! 
! 	/* Always use regular sort node when enable_incrementalsort = false */
! 	if (!enable_incrementalsort)
! 		skipCols = 0;
  
+ 	if (skipCols == 0)
+ 	{
+ 		node = makeNode(Sort);
+ 	}
+ 	else
+ 	{
+ 		IncrementalSort    *incrementalSort;
+ 
+ 		incrementalSort = makeNode(IncrementalSort);
+ 		node = &incrementalSort->sort;
+ 		incrementalSort->skipCols = skipCols;
+ 	}
+ 
+ 	plan = &node->plan;
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5843,5851 ****
   *	  'lefttree' is the node which yields input tuples
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5894,5904 ----
   *	  'lefttree' is the node which yields input tuples
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+  *	  'skipCols' is the number of presorted columns in input tuples
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						Relids relids, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5865,5871 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5918,5924 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5908,5914 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5961,5967 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5929,5935 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5982,5989 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5962,5968 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 6016,6022 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** is_projection_capable_plan(Plan *plan)
*** 6619,6624 ****
--- 6673,6679 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 889e8af..49af1f1
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index f6b8bbf..a7955e5
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3852,3865 ****
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				bool		is_sorted;
  
! 				is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 												  path->pathkeys);
! 				if (path == cheapest_partial_path || is_sorted)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (!is_sorted)
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
--- 3852,3865 ----
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				int			n_useful_pathkeys;
  
! 				n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (n_useful_pathkeys < list_length(root->group_pathkeys))
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3932,3945 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3932,3945 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_useful_pathkeys;
  
! 			n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 			if (path == cheapest_path || n_useful_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 5006,5018 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 5006,5018 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_useful_pathkeys;
  
! 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! 														 path->pathkeys);
! 		if (path == cheapest_input_path || n_useful_pathkeys > 0)
  		{
! 			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 6142,6149 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 6142,6150 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index 28a7f7e..90df9cc
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 642,647 ****
--- 642,648 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 2e3abee..0ee6812
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2782,2787 ****
--- 2782,2788 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index f620243..c83161f
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 988,994 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 988,995 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 68dee0f..1c2b815
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 103,109 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 103,109 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1304,1315 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1304,1316 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1323,1328 ****
--- 1324,1331 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1564,1570 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1567,1574 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1657,1662 ****
--- 1661,1667 ----
  	GatherMergePath *pathnode = makeNode(GatherMergePath);
  	Cost		input_startup_cost = 0;
  	Cost		input_total_cost = 0;
+ 	int			n_common_pathkeys;
  
  	Assert(subpath->parallel_safe);
  	Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1673,1679 ****
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
--- 1678,1686 ----
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 
! 	if (n_common_pathkeys == list_length(pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1687,1692 ****
--- 1694,1701 ----
  		cost_sort(&sort_path,
  				  root,
  				  pathkeys,
+ 				  n_common_pathkeys,
+ 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  subpath->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2543,2551 ****
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode = makeNode(SortPath);
  
- 	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
--- 2552,2582 ----
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode;
! 	int			n_common_pathkeys;
! 
! 	if (enable_incrementalsort)
! 		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! 	else
! 		n_common_pathkeys = 0;
! 
! 	if (n_common_pathkeys == 0)
! 	{
! 		pathnode = makeNode(SortPath);
! 		pathnode->path.pathtype = T_Sort;
! 	}
! 	else
! 	{
! 		IncrementalSortPath   *incpathnode;
! 
! 		incpathnode = makeNode(IncrementalSortPath);
! 		pathnode = &incpathnode->spath;
! 		pathnode->path.pathtype = T_IncrementalSort;
! 		incpathnode->skipCols = n_common_pathkeys;
! 	}
! 
! 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2559,2565 ****
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2590,2598 ----
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2871,2877 ****
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
--- 2904,2911 ----
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL, 0,
! 						  0.0,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 1e323d9..8f01f05
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 291,297 ****
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
  												   work_mem,
! 												   qstate->rescan_needed);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 291,298 ----
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
  												   work_mem,
! 												   qstate->rescan_needed,
! 												   false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 4bbb4a8..d9c3243
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3650,3655 ****
--- 3650,3691 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucket statistics when the specified expression is used
   * as a hash key for the given number of buckets.
   *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 6dcd738..192d3c8
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 858,863 ****
--- 858,872 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of incremental sort steps."),
+ 			NULL
+ 		},
+ 		&enable_incrementalsort,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hashed aggregation plans."),
  			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 34af8d6..a92b477
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 231,236 ****
--- 231,243 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	int64		maxSpace;		/* maximum amount of space occupied among sort
+ 								   of groups, either in-memory or on-disk */
+ 	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+ 								   space, fase when it's value for in-memory
+ 								   space */
+ 	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 573,578 ****
--- 580,588 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 607,625 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 617,646 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 636,642 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 657,663 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 654,659 ****
--- 675,681 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 694,706 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 716,729 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 742,748 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 765,771 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 773,779 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 796,802 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 864,870 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 887,893 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 939,945 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 962,968 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 981,987 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1004,1010 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1092,1107 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1115,1126 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1160,1166 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1179,1276 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	spaceUsed;
! 	bool	spaceUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		spaceUsedOnDisk = true;
! 		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		spaceUsedOnDisk = false;
! 		spaceUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	if (spaceUsed > state->maxSpace)
! 	{
! 		state->maxSpace = spaceUsed;
! 		state->maxSpaceOnDisk = spaceUsedOnDisk;
! 		state->maxSpaceStatus = state->status;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->availMem = state->allowedMem;
! 	state->lastReturnedTuple = NULL;
! 	state->slabAllocatorUsed = false;
! 	state->slabMemoryBegin = NULL;
! 	state->slabMemoryEnd = NULL;
! 	state->slabFreeHead = NULL;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 2950,2967 ****
  	 * to fix.  Is it worth creating an API for the memory context code to
  	 * tell us how much is actually used in sortcontext?
  	 */
! 	if (state->tapeset)
! 	{
  		stats->spaceType = SORT_SPACE_TYPE_DISK;
- 		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! 		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3060,3074 ----
  	 * to fix.  Is it worth creating an API for the memory context code to
  	 * tell us how much is actually used in sortcontext?
  	 */
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxSpaceOnDisk)
  		stats->spaceType = SORT_SPACE_TYPE_DISK;
  	else
  		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! 	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
  
! 	switch (state->maxSpaceStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...b2e4e50
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,31 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncrementalSort.h
+  *
+  *
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/executor/nodeIncrementalSort.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+ 
+ #include "access/parallel.h"
+ #include "nodes/execnodes.h"
+ 
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+ 
+ /* parallel instrumentation support */
+ extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+ extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+ 
+ #endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index e05bc04..ff019c5
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1743,1748 ****
--- 1743,1762 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 Shared memory container for per-worker sort information
   * ----------------
*************** typedef struct SortState
*** 1771,1776 ****
--- 1785,1828 ----
  	SharedSortInfo *shared_info;	/* one entry per worker */
  } SortState;
  
+ /* ----------------
+  *	 Shared memory container for per-worker incremental sort information
+  * ----------------
+  */
+ typedef struct IncrementalSortInfo
+ {
+ 	TuplesortInstrumentation	sinstrument;
+ 	int64						groupsCount;
+ } IncrementalSortInfo;
+ 
+ typedef struct SharedIncrementalSortInfo
+ {
+ 	int							num_workers;
+ 	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+ } SharedIncrementalSortInfo;
+ 
+ /* ----------------
+  *	 IncrementalSortState information
+  * ----------------
+  */
+ typedef struct IncrementalSortState
+ {
+ 	ScanState	ss;				/* its first field is NodeTag */
+ 	bool		bounded;		/* is the result set bounded? */
+ 	int64		bound;			/* if bounded, how many tuples are needed */
+ 	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
+ 	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	int64		bound_Done;		/* value of bound we did the sort with */
+ 	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	int64		groupsCount;	/* number of groups with equal skip keys */
+ 	TupleTableSlot *sampleSlot;	/* slot for sample tuple of sort group */
+ 	bool		am_worker;		/* are we a worker? */
+ 	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+ } IncrementalSortState;
+ 
  /* ---------------------
   *	GroupState information
   * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index ffeeb49..4b78045
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
  	T_HashJoin,
  	T_Material,
  	T_Sort,
+ 	T_IncrementalSort,
  	T_Group,
  	T_Agg,
  	T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
  	T_HashJoinState,
  	T_MaterialState,
  	T_SortState,
+ 	T_IncrementalSortState,
  	T_GroupState,
  	T_AggState,
  	T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
  	T_ProjectionPath,
  	T_ProjectSetPath,
  	T_SortPath,
+ 	T_IncrementalSortPath,
  	T_GroupPath,
  	T_UpperUniquePath,
  	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 9b38d44..0694fb2
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 749,754 ****
--- 749,765 ----
  	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
  } Sort;
  
+ 
+ /* ----------------
+  *		incremental sort node
+  * ----------------
+  */
+ typedef struct IncrementalSort
+ {
+ 	Sort		sort;
+ 	int			skipCols;		/* number of presorted columns */
+ } IncrementalSort;
+ 
  /* ---------------
   *	 group node -
   *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 9e68e65..f0a37e5
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1507,1512 ****
--- 1507,1522 ----
  } SortPath;
  
  /*
+  * IncrementalSortPath
+  */
+ typedef struct IncrementalSortPath
+ {
+ 	SortPath	spath;
+ 	int			skipCols;
+ } IncrementalSortPath;
+ 
+ 
+ /*
   * GroupPath represents grouping (of presorted input)
   *
   * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 6c2317d..138d951
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_bitmapscan;
  extern bool enable_tidscan;
  extern bool enable_sort;
+ extern bool enable_incrementalsort;
  extern bool enable_hashagg;
  extern bool enable_nestloop;
  extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 103,110 ****
  						 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 104,112 ----
  						 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index ea886b6..b4370e2
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 188,193 ****
--- 188,194 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 226,231 ****
--- 227,233 ----
  extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
  							  List *mergeclauses,
  							  List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
  extern List *truncate_useless_pathkeys(PlannerInfo *root,
  						  RelOptInfo *rel,
  						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 199a631..41b7196
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 206,211 ****
--- 206,214 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern void estimate_hash_bucket_stats(PlannerInfo *root,
  						   Node *hashkey, double nbuckets,
  						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index b6b8c8e..938d329
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 90,96 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 90,97 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 134,139 ****
--- 135,142 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					TuplesortInstrumentation *stats);
  extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort           
*** 19,27 ****
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Sort           
    Sort Key: id, data
!   ->  Seq Scan on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
--- 19,28 ----
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Incremental Sort
    Sort Key: id, data
!   Presorted Key: id
!   ->  Index Scan using test_dc_pkey on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index c698faf..fec6a4e
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE:  drop cascades to table matest1
*** 1515,1520 ****
--- 1515,1521 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
  SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1655,1663 ****
--- 1656,1700 ----
   {3,7,8,10,13,13,16,18,19,22}
  (3 rows)
  
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+                                QUERY PLAN                                
+ -------------------------------------------------------------------------
+  Merge Append
+    Sort Key: tenk1.thousand, tenk1.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+    ->  Incremental Sort
+          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
+          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Merge Append
+    Sort Key: a.thousand, a.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+    ->  Incremental Sort
+          Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
+          ->  Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  --
  -- Check that constraint exclusion works correctly with partitions using
  -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index cd1f7f3..5acfbbb
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select name, setting from pg_settings wh
*** 76,81 ****
--- 76,82 ----
   enable_gathermerge         | on
   enable_hashagg             | on
   enable_hashjoin            | on
+  enable_incrementalsort     | on
   enable_indexonlyscan       | on
   enable_indexscan           | on
   enable_material            | on
*************** select name, setting from pg_settings wh
*** 85,91 ****
   enable_seqscan             | on
   enable_sort                | on
   enable_tidscan             | on
! (13 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
--- 86,92 ----
   enable_seqscan             | on
   enable_sort                | on
   enable_tidscan             | on
! (14 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index 169d0dc..558246b
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 544,549 ****
--- 544,550 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
*************** SELECT
*** 605,613 ****
--- 606,631 ----
      ORDER BY f.i LIMIT 10)
  FROM generate_series(1, 3) g(i);
  
+ set enable_incrementalsort = on;
+ 
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  
  --
  -- Check that constraint exclusion works correctly with partitions using

#38

Alexander Korotkov

a.korotkov@postgrespro.ru

about 8 years ago

In reply to: Thomas Munro (#36)

Re: [HACKERS] [PATCH] Incremental sort

Hi!

On Mon, Nov 20, 2017 at 12:24 AM, Thomas Munro <
thomas.munro@enterprisedb.com> wrote:

On Wed, Nov 15, 2017 at 7:42 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Sure, please find rebased patch attached.

+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+
TupleTableSlot *b)
+ {
+     int n, i;
+
+     Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+     n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+     for (i = 0; i < n; i++)
+     {
+         Datum datumA, datumB, result;
+         bool isnullA, isnullB;
+         AttrNumber attno = node->skipKeys[i].attno;
+         SkipKeyData *key;
+
+         datumA = slot_getattr(a, attno, &isnullA);
+         datumB = slot_getattr(b, attno, &isnullB);
+
+         /* Special case for NULL-vs-NULL, else use standard comparison */
+         if (isnullA || isnullB)
+         {
+             if (isnullA == isnullB)
+                 continue;
+             else
+                 return false;
+         }
+
+         key = &node->skipKeys[i];
+
+         key->fcinfo.arg[0] = datumA;
+         key->fcinfo.arg[1] = datumB;
+
+         /* just for paranoia's sake, we reset isnull each time */
+         key->fcinfo.isnull = false;
+
+         result = FunctionCallInvoke(&key->fcinfo);
+
+         /* Check for null result, since caller is clearly not expecting
one */
+         if (key->fcinfo.isnull)
+             elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+         if (!DatumGetBool(result))
+             return false;
+     }
+     return true;
+ }

Is there some reason not to use ApplySortComparator for this? I think
you're missing out on lower-overhead comparators, and in any case it's
probably good code reuse, no?

However, for incremental sort case we don't need to know here whether A > B
or B > A. It's enough for us to know if A = B or A != B. In some cases
it's way cheaper. For instance, for texts equality check is basically
memcmp while comparison may use collation.

Embarrassingly, I was unaware of this patch and started prototyping

exactly the same thing independently[1]. I hadn't got very far and
will now abandon that, but that's one thing I did differently. Two
other things that may be different: I had a special case for groups of
size 1 that skipped the sorting, and I only sorted on the suffix
because I didn't put tuples with different prefixes into the sorter (I
was assuming that tuplesort_reset was going to be super efficient,
though I hadn't got around to writing that). I gather that you have
determined empirically that it's better to be able to sort groups of
at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
leading attributes, but why is that the case?

Right. The issue that not only case of one tuple per group could cause
overhead, but few tuples (like 2 or 3) is also case of overhead. Also,
overhead is related not only to sorting. While investigate of regression
case provided by Heikki [1]/messages/by-id/2c59b009-61d3-9350-04ee-4b701eb93101@iki.fi, I've seen extra time spent mostly in extra
copying of sample tuple and comparison with that. In order to cope this
overhead I've introduced MIN_GROUP_SIZE which allows to skip copying sample
tuples too frequently.

[1]: /messages/by-id/2c59b009-61d3-9350-04ee-4b701eb93101@iki.fi
/messages/by-id/2c59b009-61d3-9350-04ee-4b701eb93101@iki.fi

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#39

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Alexander Korotkov (#37)

Re: [HACKERS] [PATCH] Incremental sort

On Mon, Nov 20, 2017 at 3:34 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Thank you very much for review. I really appreciate this topic gets
attention. Please, find next revision of patch in the attachment.

I would really like to see this get into v11. This is an important
patch, that has fallen through the cracks far too many times.

--
Peter Geoghegan

#40

Thomas Munro

thomas.munro@enterprisedb.com

about 8 years ago

In reply to: Alexander Korotkov (#38)

Re: [HACKERS] [PATCH] Incremental sort

On Tue, Nov 21, 2017 at 1:00 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Mon, Nov 20, 2017 at 12:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Is there some reason not to use ApplySortComparator for this? I think
you're missing out on lower-overhead comparators, and in any case it's
probably good code reuse, no?

However, for incremental sort case we don't need to know here whether A > B
or B > A. It's enough for us to know if A = B or A != B. In some cases
it's way cheaper. For instance, for texts equality check is basically
memcmp while comparison may use collation.

Ah, right, of course.

I gather that you have
determined empirically that it's better to be able to sort groups of
at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
leading attributes, but why is that the case?

Right. The issue that not only case of one tuple per group could cause
overhead, but few tuples (like 2 or 3) is also case of overhead. Also,
overhead is related not only to sorting. While investigate of regression
case provided by Heikki [1], I've seen extra time spent mostly in extra
copying of sample tuple and comparison with that. In order to cope this
overhead I've introduced MIN_GROUP_SIZE which allows to skip copying sample
tuples too frequently.

I see. I wonder if there could ever be a function like
ExecMoveTuple(dst, src). Given the polymorphism involved it'd be
slightly complicated and you'd probably have a general case that just
copies the tuple to dst and clears src, but there might be a bunch of
cases where you can do something more efficient like moving a pointer
and pin ownership. I haven't really thought that through and
there may be fundamental problems with it...

If you're going to push the tuples into the sorter every time, then I
guess there are some special cases that could allow future
optimisations: (1) if you noticed that every prefix was different, you
can skip the sort operation (that is, you can use the sorter as a dumb
tuplestore and just get the tuples out in the same order you put them
in; not sure if Tuplesort supports that but it presumably could), (2)
if you noticed that every prefix was the same (that is, you have only
one prefix/group in the sorter) then you could sort only on the suffix
(that is, you could somehow tell Tuplesort to ignore the leading
columns), (3) as a more complicated optimisation for intermediate
group sizes 1 < n < MIN_GROUP_SIZE, you could somehow number the
groups with an integer that increments whenever you see the prefix
change, and somehow tell tuplesort.c to use that instead of the
leading columns. Ok, that last one is probably hard but the first two
might be easier...

--
Thomas Munro
http://www.enterprisedb.com

#41

Antonin Houska

ah@cybertec.at

about 8 years ago

In reply to: Alexander Korotkov (#37)

Re: [HACKERS] [PATCH] Incremental sort

Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Antonin Houska <ah@cybertec.at> wrote:

* ExecIncrementalSort()

** if (node->tuplesortstate == NULL)

If both branches contain the expression

node->groupsCount++;

I suggest it to be moved outside the "if" construct.

Done.

One more comment on this: I wonder if the field isn't incremented too
early. It seems to me that the value can end up non-zero if the input set is
to be empty (not sure if it can happen in practice).

And finally one question about regression tests: what's the purpose of the
changes in contrib/postgres_fdw/sql/postgres_fdw.sql ? I see no
IncrementalSort node in the output.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

#42

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Alexander Korotkov (#5)

Re: [HACKERS] [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Mar 20, 2017 at 6:33 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Thank you for the report.
Please, find rebased patch in the attachment.

This patch cannot be applied. Please provide a rebased version. I am
moving it to next CF with waiting on author as status.
--
Michael

#43

Alexander Korotkov

a.korotkov@postgrespro.ru

about 8 years ago

In reply to: Antonin Houska (#41)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On Wed, Nov 22, 2017 at 1:22 PM, Antonin Houska <ah@cybertec.at> wrote:

Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Antonin Houska <ah@cybertec.at> wrote:

* ExecIncrementalSort()

** if (node->tuplesortstate == NULL)

If both branches contain the expression

node->groupsCount++;

I suggest it to be moved outside the "if" construct.

Done.

One more comment on this: I wonder if the field isn't incremented too
early. It seems to me that the value can end up non-zero if the input set
is
to be empty (not sure if it can happen in practice).

That happens in practice. On empty input set, incremental sort would count
exactly one group.

# create table t (x int, y int);
CREATE TABLE
# create index t_x_idx on t (x);
CREATE INDEX
# set enable_seqscan = off;
SET
# explain (analyze, buffers) select * from t order by x, y;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Incremental Sort (cost=0.74..161.14 rows=2260 width=8) (actual
time=0.024..0.024 rows=0 loops=1)
Sort Key: x, y
Presorted Key: x
Sort Method: quicksort Memory: 25kB
Sort Groups: 1
Buffers: shared hit=1
-> Index Scan using t_x_idx on t (cost=0.15..78.06 rows=2260 width=8)
(actual time=0.011..0.011 rows=0 loops=1)
Buffers: shared hit=1
Planning time: 0.088 ms
Execution time: 0.066 ms
(10 rows)

But from prospective of how code works, it's really 1 group. Tuple sort
was defined, inserted no tuples, then sorted and got no tuples out of
there. So, I'm not sure if it's really incorrect...

And finally one question about regression tests: what's the purpose of the

changes in contrib/postgres_fdw/sql/postgres_fdw.sql ? I see no
IncrementalSort node in the output.

But there is IncrementalSort node on the remote side.
Let's see what happens. Idea of "CROSS JOIN, not pushed down" test is that
cross join with ORDER BY LIMIT is not beneficial to push down, because
LIMIT is not pushed down and remote side wouldn't be able to use top-N
heapsort. But if remote side has incremental sort then it can be used, and
fetching first 110 rows is cheap. Let's see plan of original "CROSS JOIN,
not pushed down" test with incremental sort.

# EXPLAIN (ANALYZE, VERBOSE) SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2
t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=160.32..161.31 rows=10 width=46) (actual time=1.918..1.921
rows=10 loops=1)
Output: t1.c3, t2.c3, t1.c1, t2.c1
-> Foreign Scan (cost=150.47..66711.06 rows=675684 width=46) (actual
time=1.684..1.911 rows=110 loops=1)
Output: t1.c3, t2.c3, t1.c1, t2.c1
Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
Remote SQL: SELECT r1.c3, r1."C 1", r2.c3, r2."C 1" FROM ("S 1"."T
1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS
LAST, r2."C 1" ASC NULLS LAST
Planning time: 1.370 ms
Execution time: 2.068 ms
(8 rows)

And "remote SQL" has following execution plan. This is plan of full
execution while FDW is fetching only first 110 rows out of there.

# EXPLAIN ANALYZE SELECT r1.c3, r1."C 1", r2.c3, r2."C 1" FROM ("S 1"."T 1"
r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST,
r2."C 1" ASC NULLS LAST;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Incremental Sort (cost=50.47..53097.38 rows=675684 width=34) (actual
time=1.883..747.694 rows=675684 loops=1)
Sort Key: r1."C 1", r2."C 1"
Presorted Key: r1."C 1"
Sort Method: quicksort Memory: 114kB
Sort Groups: 822
-> Nested Loop (cost=0.28..8543.25 rows=675684 width=34) (actual
time=0.027..144.070 rows=675684 loops=1)
-> Index Scan using t1_pkey on "T 1" r1 (cost=0.28..73.93
rows=822 width=17) (actual time=0.015..0.537 rows=822 loops=1)
-> Materialize (cost=0.00..25.33 rows=822 width=17) (actual
time=0.000..0.053 rows=822 loops=822)
-> Seq Scan on "T 1" r2 (cost=0.00..21.22 rows=822
width=17) (actual time=0.007..0.257 rows=822 loops=1)
Planning time: 0.109 ms
Execution time: 785.400 ms
(11 rows)

Thus, with incremental sort this test doesn't do what it was designed to
do. Changing ORDER BY from t1.c1, t2.c1 to t1.c3, t2.c3 fixes this
problem, because there is no index on c3. Query and result are slightly
different, but it serves original design.

Please, find rebased patch attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-11.patchapplication/octet-stream; name=incremental-sort-11.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
new file mode 100644
index 1063d92..aa4d7c0
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 1981,2019 ****
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
   Limit
!    Output: t1.c1, t2.c1
     ->  Sort
!          Output: t1.c1, t2.c1
!          Sort Key: t1.c1, t2.c1
           ->  Nested Loop
!                Output: t1.c1, t2.c1
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c1
!                      Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c1
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c1
!                            Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
!  c1 | c1  
! ----+-----
!   1 | 101
!   1 | 102
!   1 | 103
!   1 | 104
!   1 | 105
!   1 | 106
!   1 | 107
!   1 | 108
!   1 | 109
!   1 | 110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
--- 1981,2019 ----
  
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!                             QUERY PLAN                            
! ------------------------------------------------------------------
   Limit
!    Output: t1.c3, t2.c3
     ->  Sort
!          Output: t1.c3, t2.c3
!          Sort Key: t1.c3, t2.c3
           ->  Nested Loop
!                Output: t1.c3, t2.c3
                 ->  Foreign Scan on public.ft1 t1
!                      Output: t1.c3
!                      Remote SQL: SELECT c3 FROM "S 1"."T 1"
                 ->  Materialize
!                      Output: t2.c3
                       ->  Foreign Scan on public.ft2 t2
!                            Output: t2.c3
!                            Remote SQL: SELECT c3 FROM "S 1"."T 1"
  (15 rows)
  
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
!   c3   |  c3   
! -------+-------
!  00001 | 00101
!  00001 | 00102
!  00001 | 00103
!  00001 | 00104
!  00001 | 00105
!  00001 | 00106
!  00001 | 00107
!  00001 | 00108
!  00001 | 00109
!  00001 | 00110
  (10 rows)
  
  -- different server, not pushed down. No result expected.
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
new file mode 100644
index 0986957..cb46bfa
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 3060597..d0e7c4d
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** ANY <replaceable class="parameter">num_s
*** 3553,3558 ****
--- 3553,3572 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+       <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables or disables the query planner's use of incremental sort
+         steps. The default is <literal>on</>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
        <term><varname>enable_indexscan</varname> (<type>boolean</type>)
        <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 447f69d..a646d82
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_upper_qual(List *qual, 
*** 80,85 ****
--- 80,87 ----
  				ExplainState *es);
  static void show_sort_keys(SortState *sortstate, List *ancestors,
  			   ExplainState *es);
+ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 					   List *ancestors, ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_agg_keys(AggState *astate, List *ancestors,
*************** static void show_grouping_set_keys(PlanS
*** 93,99 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 95,101 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** static void show_sortorder_options(Strin
*** 101,106 ****
--- 103,110 ----
  static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
  				 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
+ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 									   ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  					ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 1011,1016 ****
--- 1015,1023 ----
  		case T_Sort:
  			pname = sname = "Sort";
  			break;
+ 		case T_IncrementalSort:
+ 			pname = sname = "Incremental Sort";
+ 			break;
  		case T_Group:
  			pname = sname = "Group";
  			break;
*************** ExplainNode(PlanState *planstate, List *
*** 1611,1616 ****
--- 1618,1629 ----
  			show_sort_keys(castNode(SortState, planstate), ancestors, es);
  			show_sort_info(castNode(SortState, planstate), es);
  			break;
+ 		case T_IncrementalSort:
+ 			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+ 									   ancestors, es);
+ 			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+ 									   es);
+ 			break;
  		case T_MergeAppend:
  			show_merge_append_keys(castNode(MergeAppendState, planstate),
  								   ancestors, es);
*************** static void
*** 1936,1950 ****
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
   * Likewise, for a MergeAppend node.
   */
  static void
--- 1949,1986 ----
  show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
  {
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+ 	int			skipCols;
+ 
+ 	if (IsA(plan, IncrementalSort))
+ 		skipCols = ((IncrementalSort *) plan)->skipCols;
+ 	else
+ 		skipCols = 0;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
  }
  
  /*
+  * Show the sort keys for a IncrementalSort node.
+  */
+ static void
+ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+ 						   List *ancestors, ExplainState *es)
+ {
+ 	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+ 
+ 	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+ 						 plan->sort.numCols, plan->skipCols,
+ 						 plan->sort.sortColIdx,
+ 						 plan->sort.sortOperators, plan->sort.collations,
+ 						 plan->sort.nullsFirst,
+ 						 ancestors, es);
+ }
+ 
+ /*
   * Likewise, for a MergeAppend node.
   */
  static void
*************** show_merge_append_keys(MergeAppendState 
*** 1954,1960 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1990,1996 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1978,1984 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 2014,2020 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 2047,2053 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 2083,2089 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 2104,2110 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 2140,2146 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 2117,2129 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 2153,2166 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2163,2171 ****
--- 2200,2212 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2374,2379 ****
--- 2415,2509 ----
  }
  
  /*
+  * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+  */
+ static void
+ show_incremental_sort_info(IncrementalSortState *incrsortstate,
+ 						   ExplainState *es)
+ {
+ 	if (es->analyze && incrsortstate->sort_Done &&
+ 		incrsortstate->tuplesortstate != NULL)
+ 	{
+ 		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+ 		TuplesortInstrumentation stats;
+ 		const char *sortMethod;
+ 		const char *spaceType;
+ 		long		spaceUsed;
+ 
+ 		tuplesort_get_stats(state, &stats);
+ 		sortMethod = tuplesort_method_name(stats.sortMethod);
+ 		spaceType = tuplesort_space_type_name(stats.spaceType);
+ 		spaceUsed = stats.spaceUsed;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+ 							 sortMethod, spaceType, spaceUsed);
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Sort Groups: %ld\n",
+ 							 incrsortstate->groupsCount);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyText("Sort Method", sortMethod, es);
+ 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			ExplainPropertyLong("Sort Groups: %ld",
+ 								incrsortstate->groupsCount, es);
+ 		}
+ 	}
+ 
+ 	if (incrsortstate->shared_info != NULL)
+ 	{
+ 		int			n;
+ 		bool		opened_group = false;
+ 
+ 		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+ 		{
+ 			TuplesortInstrumentation *sinstrument;
+ 			const char *sortMethod;
+ 			const char *spaceType;
+ 			long		spaceUsed;
+ 			int64		groupsCount;
+ 
+ 			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+ 			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+ 			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+ 				continue;		/* ignore any unfilled slots */
+ 			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+ 			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+ 			spaceUsed = sinstrument->spaceUsed;
+ 
+ 			if (es->format == EXPLAIN_FORMAT_TEXT)
+ 			{
+ 				appendStringInfoSpaces(es->str, es->indent * 2);
+ 				appendStringInfo(es->str,
+ 								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+ 								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+ 			}
+ 			else
+ 			{
+ 				if (!opened_group)
+ 				{
+ 					ExplainOpenGroup("Workers", "Workers", false, es);
+ 					opened_group = true;
+ 				}
+ 				ExplainOpenGroup("Worker", NULL, true, es);
+ 				ExplainPropertyInteger("Worker Number", n, es);
+ 				ExplainPropertyText("Sort Method", sortMethod, es);
+ 				ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+ 				ExplainPropertyText("Sort Space Type", spaceType, es);
+ 				ExplainPropertyLong("Sort Groups", groupsCount, es);
+ 				ExplainCloseGroup("Worker", NULL, true, es);
+ 			}
+ 		}
+ 		if (opened_group)
+ 			ExplainCloseGroup("Workers", "Workers", false, es);
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
new file mode 100644
index cc09895..572aca0
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
*************** OBJS = execAmi.o execCurrent.o execExpr.
*** 24,31 ****
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
!        nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
--- 24,31 ----
         nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
!        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
!        nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
         nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index f1636a5..dd8cffe
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 31,36 ****
--- 31,37 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecReScan(PlanState *node)
*** 253,258 ****
--- 254,263 ----
  			ExecReScanSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecReScanIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecReScanGroup((GroupState *) node);
  			break;
*************** ExecSupportsBackwardScan(Plan *node)
*** 525,532 ****
--- 530,541 ----
  		case T_CteScan:
  		case T_Material:
  		case T_Sort:
+ 			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_IncrementalSort:
+ 			return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index 53c5254..f3d6876
*** a/src/backend/executor/execParallel.c
--- b/src/backend/executor/execParallel.c
***************
*** 29,34 ****
--- 29,35 ----
  #include "executor/nodeBitmapHeapscan.h"
  #include "executor/nodeCustom.h"
  #include "executor/nodeForeignscan.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeSeqscan.h"
*************** ExecParallelEstimate(PlanState *planstat
*** 263,268 ****
--- 264,273 ----
  			/* even when not parallel-aware */
  			ExecSortEstimate((SortState *) planstate, e->pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelInitializeDSM(PlanState *pla
*** 462,467 ****
--- 467,476 ----
  			/* even when not parallel-aware */
  			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelReInitializeDSM(PlanState *p
*** 876,881 ****
--- 885,894 ----
  			/* even when not parallel-aware */
  			ExecSortReInitializeDSM((SortState *) planstate, pcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+ 			break;
  
  		default:
  			break;
*************** ExecParallelRetrieveInstrumentation(Plan
*** 934,939 ****
--- 947,954 ----
  	 */
  	if (IsA(planstate, SortState))
  		ExecSortRetrieveInstrumentation((SortState *) planstate);
+ 	else if (IsA(planstate, IncrementalSortState))
+ 		ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
  
  	return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
  								 instrumentation);
*************** ExecParallelInitializeWorker(PlanState *
*** 1164,1169 ****
--- 1179,1189 ----
  			/* even when not parallel-aware */
  			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
  			break;
+ 		case T_IncrementalSortState:
+ 			/* even when not parallel-aware */
+ 			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+ 												pwcxt);
+ 			break;
  
  		default:
  			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
new file mode 100644
index c1aa506..e4225df
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 88,93 ****
--- 88,94 ----
  #include "executor/nodeGroup.h"
  #include "executor/nodeHash.h"
  #include "executor/nodeHashjoin.h"
+ #include "executor/nodeIncrementalSort.h"
  #include "executor/nodeIndexonlyscan.h"
  #include "executor/nodeIndexscan.h"
  #include "executor/nodeLimit.h"
*************** ExecInitNode(Plan *node, EState *estate,
*** 314,319 ****
--- 315,325 ----
  												estate, eflags);
  			break;
  
+ 		case T_IncrementalSort:
+ 			result = (PlanState *) ExecInitIncrementalSort(
+ 									(IncrementalSort *) node, estate, eflags);
+ 			break;
+ 
  		case T_Group:
  			result = (PlanState *) ExecInitGroup((Group *) node,
  												 estate, eflags);
*************** ExecEndNode(PlanState *node)
*** 679,684 ****
--- 685,694 ----
  			ExecEndSort((SortState *) node);
  			break;
  
+ 		case T_IncrementalSortState:
+ 			ExecEndIncrementalSort((IncrementalSortState *) node);
+ 			break;
+ 
  		case T_GroupState:
  			ExecEndGroup((GroupState *) node);
  			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index da6ef1a..ae9edb9
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 666,671 ****
--- 666,672 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 753,759 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 754,760 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index ...1a1e48f
*** a/src/backend/executor/nodeIncrementalSort.c
--- b/src/backend/executor/nodeIncrementalSort.c
***************
*** 0 ****
--- 1,649 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncremenalSort.c
+  *	  Routines to handle incremental sorting of relations.
+  *
+  * DESCRIPTION
+  *
+  *		Incremental sort is specially optimized kind of multikey sort when
+  *		input is already presorted by prefix of required keys list.  Thus,
+  *		when it's required to sort by (key1, key2 ... keyN) and result is
+  *		already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+  *		values of (key1, key2 ... keyM) are equal.
+  *
+  *		Consider following example.  We have input tuples consisting from
+  *		two integers (x, y) already presorted by x, while it's required to
+  *		sort them by x and y.  Let input tuples be following.
+  *
+  *		(1, 5)
+  *		(1, 2)
+  *		(2, 10)
+  *		(2, 1)
+  *		(2, 5)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort algorithm would sort by y following groups, which have
+  *		equal x, individually:
+  *			(1, 5) (1, 2)
+  *			(2, 10) (2, 1) (2, 5)
+  *			(3, 3) (3, 7)
+  *
+  *		After sorting these groups and putting them altogether, we would get
+  *		following tuple set which is actually sorted by x and y.
+  *
+  *		(1, 2)
+  *		(1, 5)
+  *		(2, 1)
+  *		(2, 5)
+  *		(2, 10)
+  *		(3, 3)
+  *		(3, 7)
+  *
+  *		Incremental sort is faster than full sort on large datasets.  But
+  *		the case of most huge benefit of incremental sort is queries with
+  *		LIMIT because incremental sort can return first tuples without reading
+  *		whole input dataset.
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/executor/nodeIncremenalSort.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/htup_details.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeIncrementalSort.h"
+ #include "miscadmin.h"
+ #include "utils/lsyscache.h"
+ #include "utils/tuplesort.h"
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(IncrementalSortState *node)
+ {
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	int					skipCols,
+ 						i;
+ 
+ 	Assert(IsA(plannode, IncrementalSort));
+ 	skipCols = plannode->skipCols;
+ 
+ 	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sort.sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 										plannode->sort.sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sort.sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->sort.collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+ 															TupleTableSlot *b)
+ {
+ 	int n, i;
+ 
+ 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+ 
+ 	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = slot_getattr(a, attno, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Copying of tuples to the node->sampleSlot introduces some overhead.  It's
+  * especially notable when groups are containing one or few tuples.  In order
+  * to cope this problem we don't copy sample tuple before the group contains
+  * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+  * incremental sort, but it reduces the probability of regression.
+  */
+ #define MIN_GROUP_SIZE 32
+ 
+ /* ----------------------------------------------------------------
+  *		ExecIncrementalSort
+  *
+  *		Assuming that outer subtree returns tuple presorted by some prefix
+  *		of target sort columns, performs incremental sort.  It fetches
+  *		groups of tuples where prefix sort columns are equal and sorts them
+  *		using tuplesort.  This approach allows to evade sorting of whole
+  *		dataset.  Besides taking less memory and being faster, it allows to
+  *		start returning tuples before fetching full dataset from outer
+  *		subtree.
+  *
+  *		Conditions:
+  *		  -- none.
+  *
+  *		Initial States:
+  *		  -- the outer child is prepared to return the first tuple.
+  * ----------------------------------------------------------------
+  */
+ static TupleTableSlot *
+ ExecIncrementalSort(PlanState *pstate)
+ {
+ 	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+ 	EState			   *estate;
+ 	ScanDirection		dir;
+ 	Tuplesortstate	   *tuplesortstate;
+ 	TupleTableSlot	   *slot;
+ 	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+ 	PlanState		   *outerNode;
+ 	TupleDesc			tupDesc;
+ 	int64				nTuples = 0;
+ 
+ 	/*
+ 	 * get state info from node
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "entering routine");
+ 
+ 	estate = node->ss.ps.state;
+ 	dir = estate->es_direction;
+ 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+ 
+ 	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  false, slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
+ 	 * If first time through, read all tuples from outer plan and pass them to
+ 	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ 	 */
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "sorting subplan");
+ 
+ 	/*
+ 	 * Want to scan subplan in the forward direction while creating the
+ 	 * sorted data.
+ 	 */
+ 	estate->es_direction = ForwardScanDirection;
+ 
+ 	/*
+ 	 * Initialize tuplesort module.
+ 	 */
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "calling tuplesort_begin");
+ 
+ 	outerNode = outerPlanState(node);
+ 	tupDesc = ExecGetResultType(outerNode);
+ 
+ 	if (node->tuplesortstate == NULL)
+ 	{
+ 		/*
+ 		 * We are going to process the first group of presorted data.
+ 		 * Initialize support structures for cmpSortSkipCols - already
+ 		 * sorted columns.
+ 		 */
+ 		prepareSkipCols(node);
+ 
+ 		/*
+ 		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+ 		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+ 		 * necessary have equal value of the first column.  We unlikely will
+ 		 * have huge groups with incremental sort.  Therefore usage of
+ 		 * abbreviated keys would be likely a waste of time.
+ 		 */
+ 		tuplesortstate = tuplesort_begin_heap(
+ 									tupDesc,
+ 									plannode->sort.numCols,
+ 									plannode->sort.sortColIdx,
+ 									plannode->sort.sortOperators,
+ 									plannode->sort.collations,
+ 									plannode->sort.nullsFirst,
+ 									work_mem,
+ 									false,
+ 									true);
+ 		node->tuplesortstate = (void *) tuplesortstate;
+ 	}
+ 	else
+ 	{
+ 		/* Next group of presorted data */
+ 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ 	}
+ 	node->groupsCount++;
+ 
+ 	/* Calculate remaining bound for bounded sort */
+ 	if (node->bounded)
+ 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+ 
+ 	/* Put saved tuple to tuplesort if any */
+ 	if (!TupIsNull(node->sampleSlot))
+ 	{
+ 		tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+ 		ExecClearTuple(node->sampleSlot);
+ 		nTuples++;
+ 	}
+ 
+ 	/*
+ 	 * Put next group of tuples where skipCols sort values are equal to
+ 	 * tuplesort.
+ 	 */
+ 	for (;;)
+ 	{
+ 		slot = ExecProcNode(outerNode);
+ 
+ 		if (TupIsNull(slot))
+ 		{
+ 			node->finished = true;
+ 			break;
+ 		}
+ 
+ 		/* Put next group of presorted data to the tuplesort */
+ 		if (nTuples < MIN_GROUP_SIZE)
+ 		{
+ 			tuplesort_puttupleslot(tuplesortstate, slot);
+ 
+ 			/* Save last tuple in minimal group */
+ 			if (nTuples == MIN_GROUP_SIZE - 1)
+ 				ExecCopySlot(node->sampleSlot, slot);
+ 			nTuples++;
+ 		}
+ 		else
+ 		{
+ 			/* Iterate while skip cols are the same as in saved tuple */
+ 			bool	cmp;
+ 			cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+ 
+ 			if (cmp)
+ 			{
+ 				tuplesort_puttupleslot(tuplesortstate, slot);
+ 				nTuples++;
+ 			}
+ 			else
+ 			{
+ 				ExecCopySlot(node->sampleSlot, slot);
+ 				break;
+ 			}
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Complete the sort.
+ 	 */
+ 	tuplesort_performsort(tuplesortstate);
+ 
+ 	/*
+ 	 * restore to user specified direction
+ 	 */
+ 	estate->es_direction = dir;
+ 
+ 	/*
+ 	 * finally set the sorted flag to true
+ 	 */
+ 	node->sort_Done = true;
+ 	node->bounded_Done = node->bounded;
+ 	if (node->shared_info && node->am_worker)
+ 	{
+ 		TuplesortInstrumentation *si;
+ 
+ 		Assert(IsParallelWorker());
+ 		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ 		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+ 		tuplesort_get_stats(tuplesortstate, si);
+ 		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+ 															node->groupsCount;
+ 	}
+ 
+ 	/*
+ 	 * Adjust bound_Done with number of tuples we've actually sorted.
+ 	 */
+ 	if (node->bounded)
+ 	{
+ 		if (node->finished)
+ 			node->bound_Done = node->bound;
+ 		else
+ 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+ 	}
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+ 
+ 	SO1_printf("ExecIncrementalSort: %s\n",
+ 			   "retrieving tuple from tuplesort");
+ 
+ 	/*
+ 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+ 	 * tuples.
+ 	 */
+ 	slot = node->ss.ps.ps_ResultTupleSlot;
+ 	(void) tuplesort_gettupleslot(tuplesortstate,
+ 								  ScanDirectionIsForward(dir),
+ 								  false, slot, NULL);
+ 	return slot;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitIncrementalSort
+  *
+  *		Creates the run-time state information for the sort node
+  *		produced by the planner and initializes its outer subtree.
+  * ----------------------------------------------------------------
+  */
+ IncrementalSortState *
+ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+ {
+ 	IncrementalSortState   *incrsortstate;
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "initializing sort node");
+ 
+ 	/*
+ 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+ 	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+ 	 * bucket in tuplesortstate.
+ 	 */
+ 	Assert((eflags & (EXEC_FLAG_REWIND |
+ 					  EXEC_FLAG_BACKWARD |
+ 					  EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
+ 	 * create state structure
+ 	 */
+ 	incrsortstate = makeNode(IncrementalSortState);
+ 	incrsortstate->ss.ps.plan = (Plan *) node;
+ 	incrsortstate->ss.ps.state = estate;
+ 	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+ 
+ 	incrsortstate->bounded = false;
+ 	incrsortstate->sort_Done = false;
+ 	incrsortstate->finished = false;
+ 	incrsortstate->tuplesortstate = NULL;
+ 	incrsortstate->sampleSlot = NULL;
+ 	incrsortstate->bound_Done = 0;
+ 	incrsortstate->groupsCount = 0;
+ 	incrsortstate->skipKeys = NULL;
+ 
+ 	/*
+ 	 * Miscellaneous initialization
+ 	 *
+ 	 * Sort nodes don't initialize their ExprContexts because they never call
+ 	 * ExecQual or ExecProject.
+ 	 */
+ 
+ 	/*
+ 	 * tuple table initialization
+ 	 *
+ 	 * sort nodes only return scan tuples from their sorted relation.
+ 	 */
+ 	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+ 	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+ 
+ 	/*
+ 	 * initialize child nodes
+ 	 *
+ 	 * We shield the child node from the need to support REWIND, BACKWARD, or
+ 	 * MARK/RESTORE.
+ 	 */
+ 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+ 
+ 	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+ 
+ 	/*
+ 	 * initialize tuple type.  no need to initialize projection info because
+ 	 * this node doesn't do projections.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+ 	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+ 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+ 
+ 	/* make standalone slot to store previous tuple from outer node */
+ 	incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+ 							ExecGetResultType(outerPlanState(incrsortstate)));
+ 
+ 	SO1_printf("ExecInitIncrementalSort: %s\n",
+ 			   "sort node initialized");
+ 
+ 	return incrsortstate;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecEndIncrementalSort(node)
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecEndIncrementalSort(IncrementalSortState *node)
+ {
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "shutting down sort node");
+ 
+ 	/*
+ 	 * clean out the tuple table
+ 	 */
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	/* must drop stanalone tuple slot from outer node */
+ 	ExecDropSingleTupleTableSlot(node->sampleSlot);
+ 
+ 	/*
+ 	 * Release tuplesort resources
+ 	 */
+ 	if (node->tuplesortstate != NULL)
+ 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 
+ 	/*
+ 	 * shut down the subplan
+ 	 */
+ 	ExecEndNode(outerPlanState(node));
+ 
+ 	SO1_printf("ExecEndIncrementalSort: %s\n",
+ 			   "sort node shutdown");
+ }
+ 
+ void
+ ExecReScanIncrementalSort(IncrementalSortState *node)
+ {
+ 	PlanState  *outerPlan = outerPlanState(node);
+ 
+ 	/*
+ 	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+ 	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+ 	 * re-scan it at all.
+ 	 */
+ 	if (!node->sort_Done)
+ 		return;
+ 
+ 	/* must drop pointer to sort result tuple */
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 
+ 	/*
+ 	 * If subnode is to be rescanned then we forget previous sort results; we
+ 	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+ 	 * bounded-sort parameters changed or we didn't select randomAccess.
+ 	 *
+ 	 * Otherwise we can just rewind and rescan the sorted output.
+ 	 */
+ 	node->sort_Done = false;
+ 	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+ 	node->tuplesortstate = NULL;
+ 	node->bound_Done = 0;
+ 
+ 	/*
+ 	 * if chgParam of subnode is not null then plan will be re-scanned by
+ 	 * first ExecProcNode.
+ 	 */
+ 	if (outerPlan->chgParam == NULL)
+ 		ExecReScan(outerPlan);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *						Parallel Query Support
+  * ----------------------------------------------------------------
+  */
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortEstimate
+  *
+  *		Estimate space required to propagate sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	Size		size;
+ 
+ 	/* don't need this if not instrumenting or no workers */
+ 	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ 		return;
+ 
+ 	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+ 	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+ 	shm_toc_estimate_chunk(&pcxt->estimator, size);
+ 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortInitializeDSM
+  *
+  *		Initialize DSM space for sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	Size		size;
+ 
+ 	/* don't need this if not instrumenting or no workers */
+ 	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+ 		return;
+ 
+ 	size = offsetof(SharedIncrementalSortInfo, sinfo)
+ 		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+ 	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+ 	/* ensure any unfilled slots will contain zeroes */
+ 	memset(node->shared_info, 0, size);
+ 	node->shared_info->num_workers = pcxt->nworkers;
+ 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+ 				   node->shared_info);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortReInitializeDSM
+  *
+  *		Reset shared state before beginning a fresh scan.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+ {
+ 	/* If there's any instrumentation space, clear it for next time */
+ 	if (node->shared_info != NULL)
+ 	{
+ 		memset(node->shared_info->sinfo, 0,
+ 			   node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+ 	}
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortInitializeWorker
+  *
+  *		Attach worker to DSM space for sort statistics.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+ {
+ 	node->shared_info =
+ 		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+ 	node->am_worker = true;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSortRetrieveInstrumentation
+  *
+  *		Transfer sort statistics from DSM to private memory.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+ {
+ 	Size		size;
+ 	SharedIncrementalSortInfo *si;
+ 
+ 	if (node->shared_info == NULL)
+ 		return;
+ 
+ 	size = offsetof(SharedIncrementalSortInfo, sinfo)
+ 		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+ 	si = palloc(size);
+ 	memcpy(si, node->shared_info, size);
+ 	node->shared_info = si;
+ }
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 73aa371..ef3587c
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(PlanState *pstate)
*** 93,99 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
--- 93,100 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index aff9a62..56a5651
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copyMaterial(const Material *from)
*** 919,924 ****
--- 919,942 ----
  
  
  /*
+  * CopySortFields
+  *
+  *		This function copies the fields of the Sort node.  It is used by
+  *		all the copy functions for classes which inherit from Sort.
+  */
+ static void
+ CopySortFields(const Sort *from, Sort *newnode)
+ {
+ 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+ 
+ 	COPY_SCALAR_FIELD(numCols);
+ 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+ 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+ 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+ }
+ 
+ /*
   * _copySort
   */
  static Sort *
*************** _copySort(const Sort *from)
*** 929,941 ****
  	/*
  	 * copy node superclass fields
  	 */
! 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
! 	COPY_SCALAR_FIELD(numCols);
! 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
! 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
! 	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
  
  	return newnode;
  }
--- 947,975 ----
  	/*
  	 * copy node superclass fields
  	 */
! 	CopySortFields(from, newnode);
  
! 	return newnode;
! }
! 
! 
! /*
!  * _copyIncrementalSort
!  */
! static IncrementalSort *
! _copyIncrementalSort(const IncrementalSort *from)
! {
! 	IncrementalSort	   *newnode = makeNode(IncrementalSort);
! 
! 	/*
! 	 * copy node superclass fields
! 	 */
! 	CopySortFields((const Sort *) from, (Sort *) newnode);
! 
! 	/*
! 	 * copy remainder of node
! 	 */
! 	COPY_SCALAR_FIELD(skipCols);
  
  	return newnode;
  }
*************** copyObjectImpl(const void *from)
*** 4815,4820 ****
--- 4849,4857 ----
  		case T_Sort:
  			retval = _copySort(from);
  			break;
+ 		case T_IncrementalSort:
+ 			retval = _copyIncrementalSort(from);
+ 			break;
  		case T_Group:
  			retval = _copyGroup(from);
  			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index c97ee24..6cb9300
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outMaterial(StringInfo str, const Mater
*** 869,880 ****
  }
  
  static void
! _outSort(StringInfo str, const Sort *node)
  {
  	int			i;
  
- 	WRITE_NODE_TYPE("SORT");
- 
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
--- 869,878 ----
  }
  
  static void
! _outSortInfo(StringInfo str, const Sort *node)
  {
  	int			i;
  
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
*************** _outSort(StringInfo str, const Sort *nod
*** 897,902 ****
--- 895,918 ----
  }
  
  static void
+ _outSort(StringInfo str, const Sort *node)
+ {
+ 	WRITE_NODE_TYPE("SORT");
+ 
+ 	_outSortInfo(str, node);
+ }
+ 
+ static void
+ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
+ {
+ 	WRITE_NODE_TYPE("INCREMENTALSORT");
+ 
+ 	_outSortInfo(str, (const Sort *) node);
+ 
+ 	WRITE_INT_FIELD(skipCols);
+ }
+ 
+ static void
  _outUnique(StringInfo str, const Unique *node)
  {
  	int			i;
*************** outNode(StringInfo str, const void *obj)
*** 3737,3742 ****
--- 3753,3761 ----
  			case T_Sort:
  				_outSort(str, obj);
  				break;
+ 			case T_IncrementalSort:
+ 				_outIncrementalSort(str, obj);
+ 				break;
  			case T_Unique:
  				_outUnique(str, obj);
  				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 7eb67fc..f2b0e75
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readMaterial(void)
*** 2059,2070 ****
  }
  
  /*
!  * _readSort
   */
! static Sort *
! _readSort(void)
  {
! 	READ_LOCALS(Sort);
  
  	ReadCommonPlan(&local_node->plan);
  
--- 2059,2071 ----
  }
  
  /*
!  * ReadCommonSort
!  *	Assign the basic stuff of all nodes that inherit from Sort
   */
! static void
! ReadCommonSort(Sort *local_node)
  {
! 	READ_TEMP_LOCALS();
  
  	ReadCommonPlan(&local_node->plan);
  
*************** _readSort(void)
*** 2073,2078 ****
--- 2074,2105 ----
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
  	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+ }
+ 
+ /*
+  * _readSort
+  */
+ static Sort *
+ _readSort(void)
+ {
+ 	READ_LOCALS_NO_FIELDS(Sort);
+ 
+ 	ReadCommonSort(local_node);
+ 
+ 	READ_DONE();
+ }
+ 
+ /*
+  * _readIncrementalSort
+  */
+ static IncrementalSort *
+ _readIncrementalSort(void)
+ {
+ 	READ_LOCALS(IncrementalSort);
+ 
+ 	ReadCommonSort(&local_node->sort);
+ 
+ 	READ_INT_FIELD(skipCols);
  
  	READ_DONE();
  }
*************** parseNodeString(void)
*** 2634,2639 ****
--- 2661,2668 ----
  		return_value = _readMaterial();
  	else if (MATCH("SORT", 4))
  		return_value = _readSort();
+ 	else if (MATCH("INCREMENTALSORT", 7))
+ 		return_value = _readIncrementalSort();
  	else if (MATCH("GROUP", 5))
  		return_value = _readGroup();
  	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
new file mode 100644
index 44f6b03..fbfef2b
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** print_path(PlannerInfo *root, Path *path
*** 3461,3466 ****
--- 3461,3470 ----
  			ptype = "Sort";
  			subpath = ((SortPath *) path)->subpath;
  			break;
+ 		case T_IncrementalSortPath:
+ 			ptype = "IncrementalSort";
+ 			subpath = ((SortPath *) path)->subpath;
+ 			break;
  		case T_GroupPath:
  			ptype = "Group";
  			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index d11bf19..2f7cf60
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** bool		enable_indexonlyscan = true;
*** 121,126 ****
--- 121,127 ----
  bool		enable_bitmapscan = true;
  bool		enable_tidscan = true;
  bool		enable_sort = true;
+ bool		enable_incrementalsort = true;
  bool		enable_hashagg = true;
  bool		enable_nestloop = true;
  bool		enable_material = true;
*************** cost_recursive_union(Path *runion, Path 
*** 1601,1606 ****
--- 1602,1614 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or incremental sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1627,1633 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1635,1643 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'presorted_keys' is a number of pathkeys already presorted in given path
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1643,1661 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
  
  	path->rows = tuples;
  
--- 1653,1680 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
+ 	if (!enable_incrementalsort)
+ 		presorted_keys = 0;
  
  	path->rows = tuples;
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1681,1693 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1700,1749 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 										linitial(key->pk_eclass->ec_members);
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1697,1703 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1753,1759 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1708,1717 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1764,1773 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1719,1732 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1775,1807 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
! 		/*
! 		 * We'll use plain quicksort on all the input tuples.  If it appears
! 		 * that we expect less than two tuples per sort group then assume
! 		 * logarithmic part of estimate to be 1.
! 		 */
! 		if (group_tuples >= 2.0)
! 			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
! 		else
! 			group_cost = comparison_cost * group_tuples;
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1737,1742 ****
--- 1812,1830 ----
  	 */
  	run_cost += cpu_operator_cost * tuples;
  
+ 	/* Extra costs of incremental sort */
+ 	if (presorted_keys > 0)
+ 	{
+ 		/*
+ 		 * In incremental sort case we also have to cost to detect sort groups.
+ 		 * It turns out into extra copy and comparison for each tuple.
+ 		 */
+ 		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+ 
+ 		/* Cost of per group tuplesort reset */
+ 		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+ 	}
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2544,2549 ****
--- 2632,2639 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2570,2575 ****
--- 2660,2667 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index c6870d3..b97f22a
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 22,31 ****
--- 22,33 ----
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
  #include "optimizer/clauses.h"
+ #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 308,313 ****
--- 310,342 ----
  	return PATHKEYS_EQUAL;
  }
  
+ 
+ /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
  /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
*************** right_merge_direction(PlannerInfo *root,
*** 1488,1513 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
! static int
! pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
! 	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
- 
- 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1517,1558 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by incremental sort.
   */
! int
! pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
  {
! 	int	n_common_pathkeys;
! 
! 	if (query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
! 
! 	if (enable_incrementalsort)
  	{
! 		/*
! 		 * Return the number of path keys in common, or 0 if there are none. Any
! 		 * first common pathkeys could be useful for ordering because we can use
! 		 * incremental sort.
! 		 */
! 		return n_common_pathkeys;
! 	}
! 	else
! 	{
! 		/*
! 		 * When incremental sort is disabled, pathkeys are useful only when they
! 		 * do contain all the query pathkeys.
! 		 */
! 		if (n_common_pathkeys == list_length(query_pathkeys))
! 			return n_common_pathkeys;
! 		else
! 			return 0;
  	}
  }
  
  /*
*************** truncate_useless_pathkeys(PlannerInfo *r
*** 1523,1529 ****
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
--- 1568,1574 ----
  	int			nuseful2;
  
  	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
! 	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
  	if (nuseful2 > nuseful)
  		nuseful = nuseful2;
  
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index d445477..b080fa6
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 235,241 ****
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 235,241 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype, bool inner_unique,
  			   bool skip_mark_restore);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static EquivalenceMember *find_ec_member
*** 251,260 ****
  					   TargetEntry *tle,
  					   Relids relids);
  static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						Relids relids);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 251,261 ----
  					   TargetEntry *tle,
  					   Relids relids);
  static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						Relids relids, int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_plan_recurse(PlannerInfo *root, P
*** 436,441 ****
--- 437,443 ----
  											   (GatherPath *) best_path);
  			break;
  		case T_Sort:
+ 		case T_IncrementalSort:
  			plan = (Plan *) create_sort_plan(root,
  											 (SortPath *) best_path,
  											 flags);
*************** create_merge_append_plan(PlannerInfo *ro
*** 1120,1125 ****
--- 1122,1128 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1154,1162 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1157,1167 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1506,1511 ****
--- 1511,1517 ----
  	Plan	   *subplan;
  	List	   *pathkeys = best_path->path.pathkeys;
  	List	   *tlist = build_path_tlist(root, &best_path->path);
+ 	int			n_common_pathkeys;
  
  	/* As with Gather, it's best to project away columns in the workers. */
  	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
*************** create_gather_merge_plan(PlannerInfo *ro
*** 1535,1546 ****
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
--- 1541,1556 ----
  
  
  	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
! 	if (n_common_pathkeys < list_length(pathkeys))
! 	{
  		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+ 									 n_common_pathkeys,
  									 gm_plan->sortColIdx,
  									 gm_plan->sortOperators,
  									 gm_plan->collations,
  									 gm_plan->nullsFirst);
+ 	}
  
  	/* Now insert the subplan under GatherMerge. */
  	gm_plan->plan.lefttree = subplan;
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1653,1658 ****
--- 1663,1669 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1662,1668 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1673,1685 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	if (IsA(best_path, IncrementalSortPath))
! 		n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
! 	else
! 		n_common_pathkeys = 0;
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   NULL, n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1906,1912 ****
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan);
  			}
  
  			if (!rollup->is_hashed)
--- 1923,1930 ----
  				sort_plan = (Plan *)
  					make_sort_from_groupcols(rollup->groupClause,
  											 new_grpColIdx,
! 											 subplan,
! 											 0);
  			}
  
  			if (!rollup->is_hashed)
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3846,3855 ****
  	 */
  	if (best_path->outersortkeys)
  	{
  		Relids		outer_relids = outer_path->parent->relids;
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys,
! 												   outer_relids);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3864,3878 ----
  	 */
  	if (best_path->outersortkeys)
  	{
+ 		Sort	   *sort;
+ 		int			n_common_pathkeys;
  		Relids		outer_relids = outer_path->parent->relids;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   outer_relids, n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3860,3869 ****
  
  	if (best_path->innersortkeys)
  	{
  		Relids		inner_relids = inner_path->parent->relids;
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys,
! 												   inner_relids);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3883,3897 ----
  
  	if (best_path->innersortkeys)
  	{
+ 		Sort	   *sort;
+ 		int			n_common_pathkeys;
  		Relids		inner_relids = inner_path->parent->relids;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   inner_relids, n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4914,4921 ****
  {
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4942,4954 ----
  {
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
+ 	int			skip_cols = 0;
  
! 	if (IsA(plan, IncrementalSort))
! 		skip_cols = ((IncrementalSort *) plan)->skipCols;
! 
! 	cost_sort(&sort_path, root, NIL, skip_cols,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5504,5516 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node = makeNode(Sort);
! 	Plan	   *plan = &node->plan;
  
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
--- 5537,5567 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
! 	Sort	   *node;
! 	Plan	   *plan;
! 
! 	/* Always use regular sort node when enable_incrementalsort = false */
! 	if (!enable_incrementalsort)
! 		skipCols = 0;
  
+ 	if (skipCols == 0)
+ 	{
+ 		node = makeNode(Sort);
+ 	}
+ 	else
+ 	{
+ 		IncrementalSort    *incrementalSort;
+ 
+ 		incrementalSort = makeNode(IncrementalSort);
+ 		node = &incrementalSort->sort;
+ 		incrementalSort->skipCols = skipCols;
+ 	}
+ 
+ 	plan = &node->plan;
  	plan->targetlist = lefttree->targetlist;
  	plan->qual = NIL;
  	plan->lefttree = lefttree;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5843,5851 ****
   *	  'lefttree' is the node which yields input tuples
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5894,5904 ----
   *	  'lefttree' is the node which yields input tuples
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+  *	  'skipCols' is the number of presorted columns in input tuples
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						Relids relids, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5865,5871 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5918,5924 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5908,5914 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5961,5967 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5929,5935 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5982,5989 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5962,5968 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 6016,6022 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** is_projection_capable_plan(Plan *plan)
*** 6619,6624 ****
--- 6673,6679 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 889e8af..49af1f1
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index ef2eaea..5b41aaf
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3846,3859 ****
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				bool		is_sorted;
  
! 				is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 												  path->pathkeys);
! 				if (path == cheapest_partial_path || is_sorted)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (!is_sorted)
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
--- 3846,3859 ----
  			foreach(lc, input_rel->partial_pathlist)
  			{
  				Path	   *path = (Path *) lfirst(lc);
! 				int			n_useful_pathkeys;
  
! 				n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
  				{
  					/* Sort the cheapest partial path, if it isn't already */
! 					if (n_useful_pathkeys < list_length(root->group_pathkeys))
  						path = (Path *) create_sort_path(root,
  														 grouped_rel,
  														 path,
*************** create_grouping_paths(PlannerInfo *root,
*** 3926,3939 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3926,3939 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_useful_pathkeys;
  
! 			n_useful_pathkeys = pathkeys_useful_for_ordering(
! 										root->group_pathkeys, path->pathkeys);
! 			if (path == cheapest_path || n_useful_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 5000,5012 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 5000,5012 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_useful_pathkeys;
  
! 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
! 														 path->pathkeys);
! 		if (path == cheapest_input_path || n_useful_pathkeys > 0)
  		{
! 			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 6136,6143 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 6136,6144 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
new file mode 100644
index b5c4124..1ff9d42
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
*************** set_plan_refs(PlannerInfo *root, Plan *p
*** 642,647 ****
--- 642,648 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 2e3abee..0ee6812
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** finalize_plan(PlannerInfo *root, Plan *p
*** 2782,2787 ****
--- 2782,2788 ----
  		case T_Hash:
  		case T_Material:
  		case T_Sort:
+ 		case T_IncrementalSort:
  		case T_Unique:
  		case T_SetOp:
  		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index f620243..c83161f
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 988,994 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 988,995 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index bc0841b..d973f8b
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 103,109 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 103,109 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** create_merge_append_path(PlannerInfo *ro
*** 1304,1315 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1304,1316 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1323,1328 ****
--- 1324,1331 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1570,1576 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1573,1580 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_gather_merge_path(PlannerInfo *ro
*** 1663,1668 ****
--- 1667,1673 ----
  	GatherMergePath *pathnode = makeNode(GatherMergePath);
  	Cost		input_startup_cost = 0;
  	Cost		input_total_cost = 0;
+ 	int			n_common_pathkeys;
  
  	Assert(subpath->parallel_safe);
  	Assert(pathkeys);
*************** create_gather_merge_path(PlannerInfo *ro
*** 1679,1685 ****
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
--- 1684,1692 ----
  	pathnode->path.pathtarget = target ? target : rel->reltarget;
  	pathnode->path.rows += subpath->rows;
  
! 	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 
! 	if (n_common_pathkeys == list_length(pathkeys))
  	{
  		/* Subpath is adequately ordered, we won't need to sort it */
  		input_startup_cost += subpath->startup_cost;
*************** create_gather_merge_path(PlannerInfo *ro
*** 1693,1698 ****
--- 1700,1707 ----
  		cost_sort(&sort_path,
  				  root,
  				  pathkeys,
+ 				  n_common_pathkeys,
+ 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  subpath->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2549,2557 ****
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode = makeNode(SortPath);
  
- 	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
--- 2558,2588 ----
  				 List *pathkeys,
  				 double limit_tuples)
  {
! 	SortPath   *pathnode;
! 	int			n_common_pathkeys;
! 
! 	if (enable_incrementalsort)
! 		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
! 	else
! 		n_common_pathkeys = 0;
! 
! 	if (n_common_pathkeys == 0)
! 	{
! 		pathnode = makeNode(SortPath);
! 		pathnode->path.pathtype = T_Sort;
! 	}
! 	else
! 	{
! 		IncrementalSortPath   *incpathnode;
! 
! 		incpathnode = makeNode(IncrementalSortPath);
! 		pathnode = &incpathnode->spath;
! 		pathnode->path.pathtype = T_IncrementalSort;
! 		incpathnode->skipCols = n_common_pathkeys;
! 	}
! 
! 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.parent = rel;
  	/* Sort doesn't project, so use source path's pathtarget */
  	pathnode->path.pathtarget = subpath->pathtarget;
*************** create_sort_path(PlannerInfo *root,
*** 2565,2571 ****
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2596,2604 ----
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2877,2883 ****
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
--- 2910,2917 ----
  			else
  			{
  				/* Account for cost of sort, but don't charge input cost again */
! 				cost_sort(&sort_path, root, NIL, 0,
! 						  0.0,
  						  0.0,
  						  subpath->rows,
  						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index 1e323d9..8f01f05
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 291,297 ****
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
  												   work_mem,
! 												   qstate->rescan_needed);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 291,298 ----
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
  												   work_mem,
! 												   qstate->rescan_needed,
! 												   false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index ea95b80..abf6c38
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3715,3720 ****
--- 3715,3756 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucket statistics when the specified expression is used
   * as a hash key for the given number of buckets.
   *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 6dcd738..192d3c8
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** static struct config_bool ConfigureNames
*** 858,863 ****
--- 858,872 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("Enables the planner's use of incremental sort steps."),
+ 			NULL
+ 		},
+ 		&enable_incrementalsort,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of hashed aggregation plans."),
  			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 3c23ac7..118edb9
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 231,236 ****
--- 231,243 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	int64		maxSpace;		/* maximum amount of space occupied among sort
+ 								   of groups, either in-memory or on-disk */
+ 	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+ 								   space, fase when it's value for in-memory
+ 								   space */
+ 	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 573,578 ****
--- 580,588 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 607,625 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 617,646 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 636,642 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 657,663 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 654,659 ****
--- 675,681 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 694,706 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 716,729 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 742,748 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 765,771 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 773,779 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 796,802 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 864,870 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 887,893 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 939,945 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 962,968 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 981,987 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1004,1010 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1092,1107 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1115,1126 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1160,1166 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1179,1276 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	spaceUsed;
! 	bool	spaceUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		spaceUsedOnDisk = true;
! 		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		spaceUsedOnDisk = false;
! 		spaceUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	if (spaceUsed > state->maxSpace)
! 	{
! 		state->maxSpace = spaceUsed;
! 		state->maxSpaceOnDisk = spaceUsedOnDisk;
! 		state->maxSpaceStatus = state->status;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->availMem = state->allowedMem;
! 	state->lastReturnedTuple = NULL;
! 	state->slabAllocatorUsed = false;
! 	state->slabMemoryBegin = NULL;
! 	state->slabMemoryEnd = NULL;
! 	state->slabFreeHead = NULL;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 2949,2966 ****
  	 * to fix.  Is it worth creating an API for the memory context code to
  	 * tell us how much is actually used in sortcontext?
  	 */
! 	if (state->tapeset)
! 	{
  		stats->spaceType = SORT_SPACE_TYPE_DISK;
- 		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! 		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3059,3073 ----
  	 * to fix.  Is it worth creating an API for the memory context code to
  	 * tell us how much is actually used in sortcontext?
  	 */
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxSpaceOnDisk)
  		stats->spaceType = SORT_SPACE_TYPE_DISK;
  	else
  		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
! 	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
  
! 	switch (state->maxSpaceStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index ...b2e4e50
*** a/src/include/executor/nodeIncrementalSort.h
--- b/src/include/executor/nodeIncrementalSort.h
***************
*** 0 ****
--- 1,31 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeIncrementalSort.h
+  *
+  *
+  *
+  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/executor/nodeIncrementalSort.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef NODEINCREMENTALSORT_H
+ #define NODEINCREMENTALSORT_H
+ 
+ #include "access/parallel.h"
+ #include "nodes/execnodes.h"
+ 
+ extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+ extern void ExecEndIncrementalSort(IncrementalSortState *node);
+ extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+ 
+ /* parallel instrumentation support */
+ extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+ extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+ extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+ 
+ #endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index e05bc04..ff019c5
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1743,1748 ****
--- 1743,1762 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 Shared memory container for per-worker sort information
   * ----------------
*************** typedef struct SortState
*** 1771,1776 ****
--- 1785,1828 ----
  	SharedSortInfo *shared_info;	/* one entry per worker */
  } SortState;
  
+ /* ----------------
+  *	 Shared memory container for per-worker incremental sort information
+  * ----------------
+  */
+ typedef struct IncrementalSortInfo
+ {
+ 	TuplesortInstrumentation	sinstrument;
+ 	int64						groupsCount;
+ } IncrementalSortInfo;
+ 
+ typedef struct SharedIncrementalSortInfo
+ {
+ 	int							num_workers;
+ 	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+ } SharedIncrementalSortInfo;
+ 
+ /* ----------------
+  *	 IncrementalSortState information
+  * ----------------
+  */
+ typedef struct IncrementalSortState
+ {
+ 	ScanState	ss;				/* its first field is NodeTag */
+ 	bool		bounded;		/* is the result set bounded? */
+ 	int64		bound;			/* if bounded, how many tuples are needed */
+ 	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
+ 	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	int64		bound_Done;		/* value of bound we did the sort with */
+ 	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	int64		groupsCount;	/* number of groups with equal skip keys */
+ 	TupleTableSlot *sampleSlot;	/* slot for sample tuple of sort group */
+ 	bool		am_worker;		/* are we a worker? */
+ 	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+ } IncrementalSortState;
+ 
  /* ---------------------
   *	GroupState information
   * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
new file mode 100644
index c5b5115..9ae5d57
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
*************** typedef enum NodeTag
*** 73,78 ****
--- 73,79 ----
  	T_HashJoin,
  	T_Material,
  	T_Sort,
+ 	T_IncrementalSort,
  	T_Group,
  	T_Agg,
  	T_WindowAgg,
*************** typedef enum NodeTag
*** 125,130 ****
--- 126,132 ----
  	T_HashJoinState,
  	T_MaterialState,
  	T_SortState,
+ 	T_IncrementalSortState,
  	T_GroupState,
  	T_AggState,
  	T_WindowAggState,
*************** typedef enum NodeTag
*** 240,245 ****
--- 242,248 ----
  	T_ProjectionPath,
  	T_ProjectSetPath,
  	T_SortPath,
+ 	T_IncrementalSortPath,
  	T_GroupPath,
  	T_UpperUniquePath,
  	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 9b38d44..0694fb2
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 749,754 ****
--- 749,765 ----
  	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
  } Sort;
  
+ 
+ /* ----------------
+  *		incremental sort node
+  * ----------------
+  */
+ typedef struct IncrementalSort
+ {
+ 	Sort		sort;
+ 	int			skipCols;		/* number of presorted columns */
+ } IncrementalSort;
+ 
  /* ---------------
   *	 group node -
   *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 51df8e9..a979461
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1507,1512 ****
--- 1507,1522 ----
  } SortPath;
  
  /*
+  * IncrementalSortPath
+  */
+ typedef struct IncrementalSortPath
+ {
+ 	SortPath	spath;
+ 	int			skipCols;
+ } IncrementalSortPath;
+ 
+ 
+ /*
   * GroupPath represents grouping (of presorted input)
   *
   * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 6c2317d..138d951
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern bool enable_indexonlyscan;
*** 61,66 ****
--- 61,67 ----
  extern bool enable_bitmapscan;
  extern bool enable_tidscan;
  extern bool enable_sort;
+ extern bool enable_incrementalsort;
  extern bool enable_hashagg;
  extern bool enable_nestloop;
  extern bool enable_material;
*************** extern void cost_namedtuplestorescan(Pat
*** 103,110 ****
  						 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 104,112 ----
  						 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index ea886b6..b4370e2
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 188,193 ****
--- 188,194 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion,
*************** extern List *select_outer_pathkeys_for_m
*** 226,231 ****
--- 227,233 ----
  extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
  							  List *mergeclauses,
  							  List *outer_pathkeys);
+ extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
  extern List *truncate_useless_pathkeys(PlannerInfo *root,
  						  RelOptInfo *rel,
  						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 199a631..41b7196
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 206,211 ****
--- 206,214 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern void estimate_hash_bucket_stats(PlannerInfo *root,
  						   Node *hashkey, double nbuckets,
  						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index b6b8c8e..938d329
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 90,96 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 90,97 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 134,139 ****
--- 135,142 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					TuplesortInstrumentation *stats);
  extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
new file mode 100644
index 75dff56..e11fb61
*** a/src/test/isolation/expected/drop-index-concurrently-1.out
--- b/src/test/isolation/expected/drop-index-concurrently-1.out
*************** Sort           
*** 19,27 ****
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Sort           
    Sort Key: id, data
!   ->  Seq Scan on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
--- 19,28 ----
  step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
  QUERY PLAN     
  
! Incremental Sort
    Sort Key: id, data
!   Presorted Key: id
!   ->  Index Scan using test_dc_pkey on test_dc
          Filter: ((data)::text = '34'::text)
  step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
  id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index fac7b62..18dc749
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** NOTICE:  drop cascades to table matest1
*** 1515,1520 ****
--- 1515,1521 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
  SELECT thousand, tenthous FROM tenk1
*************** FROM generate_series(1, 3) g(i);
*** 1655,1663 ****
--- 1656,1700 ----
   {3,7,8,10,13,13,16,18,19,22}
  (3 rows)
  
+ set enable_incrementalsort = on;
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+                                QUERY PLAN                                
+ -------------------------------------------------------------------------
+  Merge Append
+    Sort Key: tenk1.thousand, tenk1.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+    ->  Incremental Sort
+          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
+          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+ (7 rows)
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+                          QUERY PLAN                          
+ -------------------------------------------------------------
+  Merge Append
+    Sort Key: a.thousand, a.tenthous
+    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+    ->  Incremental Sort
+          Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
+          ->  Index Only Scan using tenk1_unique2 on tenk1 b
+ (7 rows)
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  --
  -- Check that constraint exclusion works correctly with partitions using
  -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
new file mode 100644
index cd1f7f3..5acfbbb
*** a/src/test/regress/expected/sysviews.out
--- b/src/test/regress/expected/sysviews.out
*************** select name, setting from pg_settings wh
*** 76,81 ****
--- 76,82 ----
   enable_gathermerge         | on
   enable_hashagg             | on
   enable_hashjoin            | on
+  enable_incrementalsort     | on
   enable_indexonlyscan       | on
   enable_indexscan           | on
   enable_material            | on
*************** select name, setting from pg_settings wh
*** 85,91 ****
   enable_seqscan             | on
   enable_sort                | on
   enable_tidscan             | on
! (13 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
--- 86,92 ----
   enable_seqscan             | on
   enable_sort                | on
   enable_tidscan             | on
! (14 rows)
  
  -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
  -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
new file mode 100644
index c71febf..5e077a9
*** a/src/test/regress/sql/inherit.sql
--- b/src/test/regress/sql/inherit.sql
*************** drop table matest0 cascade;
*** 544,549 ****
--- 544,550 ----
  set enable_seqscan = off;
  set enable_indexscan = on;
  set enable_bitmapscan = off;
+ set enable_incrementalsort = off;
  
  -- Check handling of duplicated, constant, or volatile targetlist items
  explain (costs off)
*************** SELECT
*** 605,613 ****
--- 606,631 ----
      ORDER BY f.i LIMIT 10)
  FROM generate_series(1, 3) g(i);
  
+ set enable_incrementalsort = on;
+ 
+ -- check incremental sort is used when enabled
+ explain (costs off)
+ SELECT thousand, tenthous FROM tenk1
+ UNION ALL
+ SELECT thousand, thousand FROM tenk1
+ ORDER BY thousand, tenthous;
+ 
+ explain (costs off)
+ SELECT x, y FROM
+   (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+    UNION ALL
+    SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ ORDER BY x, y;
+ 
  reset enable_seqscan;
  reset enable_indexscan;
  reset enable_bitmapscan;
+ reset enable_incrementalsort;
  
  --
  -- Check that constraint exclusion works correctly with partitions using

#44

Antonin Houska

ah@cybertec.at

about 8 years ago

In reply to: Alexander Korotkov (#43)

Re: [HACKERS] [PATCH] Incremental sort

Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Wed, Nov 22, 2017 at 1:22 PM, Antonin Houska <ah@cybertec.at> wrote:

Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Antonin Houska <ah@cybertec.at> wrote:

* ExecIncrementalSort()

** if (node->tuplesortstate == NULL)

If both branches contain the expression

node->groupsCount++;

I suggest it to be moved outside the "if" construct.

Done.

One more comment on this: I wonder if the field isn't incremented too
early. It seems to me that the value can end up non-zero if the input set is
to be empty (not sure if it can happen in practice).

That happens in practice. On empty input set, incremental sort would count exactly one group.

# create table t (x int, y int);
CREATE TABLE
# create index t_x_idx on t (x);
CREATE INDEX
# set enable_seqscan = off;
SET
# explain (analyze, buffers) select * from t order by x, y;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Incremental Sort (cost=0.74..161.14 rows=2260 width=8) (actual time=0.024..0.024 rows=0 loops=1)
Sort Key: x, y
Presorted Key: x
Sort Method: quicksort Memory: 25kB
Sort Groups: 1
Buffers: shared hit=1
-> Index Scan using t_x_idx on t (cost=0.15..78.06 rows=2260 width=8) (actual time=0.011..0.011 rows=0 loops=1)
Buffers: shared hit=1
Planning time: 0.088 ms
Execution time: 0.066 ms
(10 rows)
But from prospective of how code works, it's really 1 group. Tuple sort was defined, inserted no tuples, then sorted and got no tuples out of there. So, I'm not sure if it's really incorrect...

I expected the number of groups actually that actually appear in the output,
you consider it the number of groups started. I can't find similar case
elsewhere in the code (e.g. Agg node does not report this kind of
information), so I have no clue. Someone else will have to decide.

But there is IncrementalSort node on the remote side.
Let's see what happens. Idea of "CROSS JOIN, not pushed down" test is that cross join with ORDER BY LIMIT is not beneficial to push down, because LIMIT is not pushed down and remote side wouldn't be able to use top-N heapsort. But if remote side has incremental sort then it can be
used, and fetching first 110 rows is cheap. Let's see plan of original "CROSS JOIN, not pushed down" test with incremental sort.

# EXPLAIN (ANALYZE, VERBOSE) SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;

ok, understood, thanks. Perhaps it's worth a comment in the test script.

I'm afraid I still see a problem. The diff removes a query that (although a
bit different from the one above) lets the CROSS JOIN to be pushed down and
does introduce the IncrementalSort in the remote database. This query is
replaced with one that does not allow for the join push down.

*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;

Shouldn't the test contain *both* cases?

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

#45

Alexander Korotkov

a.korotkov@postgrespro.ru

about 8 years ago

In reply to: Thomas Munro (#40)

Re: [HACKERS] [PATCH] Incremental sort

On Wed, Nov 22, 2017 at 12:01 AM, Thomas Munro <
thomas.munro@enterprisedb.com> wrote:

I gather that you have
determined empirically that it's better to be able to sort groups of
at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
leading attributes, but why is that the case?

Right. The issue that not only case of one tuple per group could cause
overhead, but few tuples (like 2 or 3) is also case of overhead. Also,
overhead is related not only to sorting. While investigate of regression
case provided by Heikki [1], I've seen extra time spent mostly in extra
copying of sample tuple and comparison with that. In order to cope this
overhead I've introduced MIN_GROUP_SIZE which allows to skip copying

sample

tuples too frequently.

I see. I wonder if there could ever be a function like
ExecMoveTuple(dst, src). Given the polymorphism involved it'd be
slightly complicated and you'd probably have a general case that just
copies the tuple to dst and clears src, but there might be a bunch of
cases where you can do something more efficient like moving a pointer
and pin ownership. I haven't really thought that through and
there may be fundamental problems with it...

ExecMoveTuple(dst, src) would be good. But, it would be hard to implement
"moving a pointer and pin ownership" principle in our current
infrastructure. It's because source and destination can have different
memory contexts. AFAICS, we can't just move memory area between memory
contexts: we have to allocate new area, then memcpy, and then deallocate
old area.

If you're going to push the tuples into the sorter every time, then I
guess there are some special cases that could allow future
optimisations: (1) if you noticed that every prefix was different, you
can skip the sort operation (that is, you can use the sorter as a dumb
tuplestore and just get the tuples out in the same order you put them
in; not sure if Tuplesort supports that but it presumably could),

In order to notice that every prefix is different, I have to compare every
prefix. But that may introduce an overhead. So, there reason why I
introduced MIN_GROUP_SIZE is exactly to not compare every prefix...

(2)
if you noticed that every prefix was the same (that is, you have only
one prefix/group in the sorter) then you could sort only on the suffix
(that is, you could somehow tell Tuplesort to ignore the leading
columns),

Yes, I did so before. But again, after introducing MIN_GROUP_SIZE, I
missed knowledge whether all the prefixes were the same or different. This
is why, I've to sort by full column list for now...

(3) as a more complicated optimisation for intermediate

group sizes 1 < n < MIN_GROUP_SIZE, you could somehow number the
groups with an integer that increments whenever you see the prefix
change, and somehow tell tuplesort.c to use that instead of the
leading columns.

That is interesting idea. The reason we have an overhead in comparison
with plain sort is that we do extra comparison (and copying), but knowledge
of this comparison result is lost for sorting itself. Thus, sorting can
"reuse" prefix comparison, and overhead would be lower. But the problem is
that we have to reformat tuples before putting them into tuplesort. I
wonder if tuple reformatting could eat potential performance win...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#46

Alexander Korotkov

a.korotkov@postgrespro.ru

about 8 years ago

In reply to: Antonin Houska (#44)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

Hi!

On Fri, Dec 1, 2017 at 11:39 AM, Antonin Houska <ah@cybertec.at> wrote:

I expected the number of groups actually that actually appear in the
output,
you consider it the number of groups started. I can't find similar case
elsewhere in the code (e.g. Agg node does not report this kind of
information), so I have no clue. Someone else will have to decide.

OK.

But there is IncrementalSort node on the remote side.

Let's see what happens. Idea of "CROSS JOIN, not pushed down" test is

that cross join with ORDER BY LIMIT is not beneficial to push down, because
LIMIT is not pushed down and remote side wouldn't be able to use top-N
heapsort. But if remote side has incremental sort then it can be

used, and fetching first 110 rows is cheap. Let's see plan of original

"CROSS JOIN, not pushed down" test with incremental sort.

# EXPLAIN (ANALYZE, VERBOSE) SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN

ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;

ok, understood, thanks. Perhaps it's worth a comment in the test script.

I'm afraid I still see a problem. The diff removes a query that (although a
bit different from the one above) lets the CROSS JOIN to be pushed down and
does introduce the IncrementalSort in the remote database. This query is
replaced with one that does not allow for the join push down.
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE
t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1
OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1
OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY
t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE
t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
-- CROSS JOIN, not pushed down
EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3
OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3
OFFSET 100 LIMIT 10;
-- different server, not pushed down. No result expected.
EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY
t1.c1, t2.c1 OFFSET 100 LIMIT 10;
Shouldn't the test contain *both* cases?

Thank you for pointing that. Sure, both cases are better. I've added
second case as well as comments. Patch is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-12.patchapplication/octet-stream; name=incremental-sort-12.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641fa7..1814f98b8e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1979,27 +1979,18 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit
    Output: t1.c1, t2.c1
-   ->  Sort
+   ->  Foreign Scan
          Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
-         ->  Nested Loop
-               Output: t1.c1, t2.c1
-               ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-               ->  Materialize
-                     Output: t2.c1
-                     ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-(15 rows)
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
 
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
@@ -2016,6 +2007,44 @@ SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 1
   1 | 110
 (10 rows)
 
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Limit
+   Output: t1.c3, t2.c3
+   ->  Sort
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
+         ->  Nested Loop
+               Output: t1.c3, t2.c3
+               ->  Foreign Scan on public.ft1 t1
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
+               ->  Materialize
+                     Output: t2.c3
+                     ->  Foreign Scan on public.ft2 t2
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
+(15 rows)
+
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
 -- different server, not pushed down. No result expected.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c705f..bbf697d64b 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -508,10 +508,15 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
 -- different server, not pushed down. No result expected.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 533faf060d..3335fee127 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3553,6 +3553,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e4fbafc53..0f993faba4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1011,6 +1015,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1611,6 +1618,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -1936,14 +1949,37 @@ static void
 show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 {
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+	int			skipCols;
+
+	if (IsA(plan, IncrementalSort))
+		skipCols = ((IncrementalSort *) plan)->skipCols;
+	else
+		skipCols = 0;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, skipCols, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->skipCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -1954,7 +1990,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -1978,7 +2014,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2047,7 +2083,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2104,7 +2140,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2117,13 +2153,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2163,9 +2200,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2373,6 +2414,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->groupsCount);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyLong("Sort Groups: %ld",
+								incrsortstate->groupsCount, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		groupsCount;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyLong("Sort Groups", groupsCount, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index f1636a5b88..dd8cffea9c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 558cb08b07..9cb16ca1b6 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -274,6 +275,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -482,6 +487,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -917,6 +926,10 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortReInitializeDSM((SortState *) planstate, pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+			break;
 
 		default:
 			break;
@@ -987,6 +1000,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1231,6 +1247,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9befca9016..7e7e3e666e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort(
+									(IncrementalSort *) node, estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -679,6 +685,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index da6ef1a94c..ae9edb96ab 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -666,6 +666,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
+												  false,
 												  false);
 	}
 
@@ -753,7 +754,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, false);
+									 work_mem, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1a1e48fb77
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,649 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is specially optimized kind of multikey sort when
+ *		input is already presorted by prefix of required keys list.  Thus,
+ *		when it's required to sort by (key1, key2 ... keyN) and result is
+ *		already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ *		values of (key1, key2 ... keyM) are equal.
+ *
+ *		Consider following example.  We have input tuples consisting from
+ *		two integers (x, y) already presorted by x, while it's required to
+ *		sort them by x and y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 10)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would sort by y following groups, which have
+ *		equal x, individually:
+ *			(1, 5) (1, 2)
+ *			(2, 10) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		following tuple set which is actually sorted by x and y.
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 10)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort is faster than full sort on large datasets.  But
+ *		the case of most huge benefit of incremental sort is queries with
+ *		LIMIT because incremental sort can return first tuples without reading
+ *		whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for skipKeys comparison.
+ */
+static void
+prepareSkipCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					skipCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	skipCols = plannode->skipCols;
+
+	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+	for (i = 0; i < skipCols; i++)
+	{
+		Oid equalityOp, equalityFunc;
+		SkipKeyData *key;
+
+		key = &node->skipKeys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check if first "skipCols" sort values are equal.
+ */
+static bool
+cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+															TupleTableSlot *b)
+{
+	int n, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+	for (i = 0; i < n; i++)
+	{
+		Datum datumA, datumB, result;
+		bool isnullA, isnullB;
+		AttrNumber attno = node->skipKeys[i].attno;
+		SkipKeyData *key;
+
+		datumA = slot_getattr(a, attno, &isnullA);
+		datumB = slot_getattr(b, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->skipKeys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Copying of tuples to the node->sampleSlot introduces some overhead.  It's
+ * especially notable when groups are containing one or few tuples.  In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from sorted set if any.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * If first time through, read all tuples from outer plan and pass them to
+	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	/*
+	 * Initialize tuplesort module.
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "calling tuplesort_begin");
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortSkipCols - already
+		 * sorted columns.
+		 */
+		prepareSkipCols(node);
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->groupsCount++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* Put saved tuple to tuplesort if any */
+	if (!TupIsNull(node->sampleSlot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+		ExecClearTuple(node->sampleSlot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where skipCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/* Put next group of presorted data to the tuplesort */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Save last tuple in minimal group */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->sampleSlot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/* Iterate while skip cols are the same as in saved tuple */
+			bool	cmp;
+			cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+			if (cmp)
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->sampleSlot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+															node->groupsCount;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->sampleSlot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->groupsCount = 0;
+	incrsortstate->skipKeys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * tuple table initialization
+	 *
+	 * sort nodes only return scan tuples from their sorted relation.
+	 */
+	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * initialize tuple type.  no need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortReInitializeDSM
+ *
+ *		Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	/* If there's any instrumentation space, clear it for next time */
+	if (node->shared_info != NULL)
+	{
+		memset(node->shared_info->sinfo, 0,
+			   node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73aa3715e6..ef3587c2f0 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  node->randomAccess);
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index b1515dd8e1..b468158a4c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -919,6 +919,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -930,13 +948,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(skipCols);
 
 	return newnode;
 }
@@ -4816,6 +4850,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b59a5219a7..29dbb7b665 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -870,12 +870,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -897,6 +895,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(skipCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3738,6 +3754,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 0d17ae89b0..baf9ba034c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2060,12 +2060,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2074,6 +2075,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(skipCols);
 
 	READ_DONE();
 }
@@ -2635,6 +2662,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 7))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 47986ba80a..029617219e 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3613,6 +3613,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 877827dcb5..440bfbfd6e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -121,6 +121,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1604,6 +1605,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  *	  Determines and returns the cost of sorting a relation, including
  *	  the cost of reading the input data.
  *
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys.  In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys.  And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly.  Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
  * comparisons for t tuples.
@@ -1630,7 +1638,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * work that has to be done to prepare the inputs to the comparison operators.
  *
  * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
@@ -1646,19 +1656,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
+	Cost		startup_cost = input_startup_cost;
+	Cost		run_cost = 0,
+				rest_cost,
+				group_cost,
+				input_run_cost = input_total_cost - input_startup_cost;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
+	double		num_groups,
+				group_input_bytes,
+				group_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
 	if (!enable_sort)
 		startup_cost += disable_cost;
+	if (!enable_incrementalsort)
+		presorted_keys = 0;
 
 	path->rows = tuples;
 
@@ -1684,13 +1703,50 @@ cost_sort(Path *path, PlannerInfo *root,
 		output_bytes = input_bytes;
 	}
 
-	if (output_bytes > sort_mem_bytes)
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	if (presorted_keys > 0)
+	{
+		List	   *presortedExprs = NIL;
+		ListCell   *l;
+		int			i = 0;
+
+		/* Extract presorted keys as list of expressions */
+		foreach(l, pathkeys)
+		{
+			PathKey *key = (PathKey *)lfirst(l);
+			EquivalenceMember *member = (EquivalenceMember *)
+										linitial(key->pk_eclass->ec_members);
+
+			presortedExprs = lappend(presortedExprs, member->em_expr);
+
+			i++;
+			if (i >= presorted_keys)
+				break;
+		}
+
+		/* Estimate number of groups with equal presorted keys */
+		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+	}
+	else
+	{
+		num_groups = 1.0;
+	}
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.
+	 */
+	group_input_bytes = input_bytes / num_groups;
+	group_tuples = tuples / num_groups;
+	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
-		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = input_bytes / sort_mem_bytes;
+		double		npages = ceil(group_input_bytes / BLCKSZ);
+		double		nruns = group_input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1700,7 +1756,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 
 		/* Disk costs */
 
@@ -1711,10 +1767,10 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		group_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
-	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1722,14 +1778,33 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
-		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		/*
+		 * We'll use plain quicksort on all the input tuples.  If it appears
+		 * that we expect less than two tuples per sort group then assume
+		 * logarithmic part of estimate to be 1.
+		 */
+		if (group_tuples >= 2.0)
+			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+		else
+			group_cost = comparison_cost * group_tuples;
 	}
 
+	/* Add per group cost of fetching tuples from input */
+	group_cost += input_run_cost / num_groups;
+
+	/*
+	 * We've to sort first group to start output from node. Sorting rest of
+	 * groups are required to return all the other tuples.
+	 */
+	startup_cost += group_cost;
+	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+	if (rest_cost > 0.0)
+		run_cost += rest_cost;
+
 	/*
 	 * Also charge a small amount (arbitrarily set equal to operator cost) per
 	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
@@ -1740,6 +1815,19 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	run_cost += cpu_operator_cost * tuples;
 
+	/* Extra costs of incremental sort */
+	if (presorted_keys > 0)
+	{
+		/*
+		 * In incremental sort case we also have to cost to detect sort groups.
+		 * It turns out into extra copy and comparison for each tuple.
+		 */
+		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+		/* Cost of per group tuplesort reset */
+		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+	}
+
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
@@ -2708,6 +2796,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->pathtarget->width,
@@ -2734,6 +2824,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index c6870d314e..b97f22a23c 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,33 @@ compare_pathkeys(List *keys1, List *keys2)
 	return PATHKEYS_EQUAL;
 }
 
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int n;
+	ListCell   *key1,
+			   *key2;
+	n = 0;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+			return n;
+		n++;
+	}
+
+	return n;
+}
+
+
 /*
  * pathkeys_contained_in
  *	  Common special case of compare_pathkeys: we just want to know
@@ -1488,26 +1517,42 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
  */
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 {
-	if (root->query_pathkeys == NIL)
+	int	n_common_pathkeys;
+
+	if (query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
+
+	if (enable_incrementalsort)
 	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		/*
+		 * Return the number of path keys in common, or 0 if there are none. Any
+		 * first common pathkeys could be useful for ordering because we can use
+		 * incremental sort.
+		 */
+		return n_common_pathkeys;
+	}
+	else
+	{
+		/*
+		 * When incremental sort is disabled, pathkeys are useful only when they
+		 * do contain all the query pathkeys.
+		 */
+		if (n_common_pathkeys == list_length(query_pathkeys))
+			return n_common_pathkeys;
+		else
+			return 0;
 	}
-
-	return 0;					/* path ordering not useful */
 }
 
 /*
@@ -1523,7 +1568,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
 	int			nuseful2;
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
-	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f6c83d0477..7833c4512b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int skipCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree);
+						 Plan *lefttree,
+						 int skipCols);
 static Material *make_material(Plan *lefttree);
 static WindowAgg *make_windowagg(List *tlist, Index winref,
 			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -437,6 +438,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1122,6 +1124,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		Oid		   *sortOperators;
 		Oid		   *collations;
 		bool	   *nullsFirst;
+		int			n_common_pathkeys;
 
 		/* Build the child plan */
 		/* Must insist that all children return the same tlist */
@@ -1156,9 +1159,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+		if (n_common_pathkeys < list_length(pathkeys))
 		{
 			Sort	   *sort = make_sort(subplan, numsortkeys,
+										 n_common_pathkeys,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1508,6 +1513,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 	Plan	   *subplan;
 	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	int			n_common_pathkeys;
 
 	/* As with Gather, it's best to project away columns in the workers. */
 	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1537,12 +1543,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
+	if (n_common_pathkeys < list_length(pathkeys))
+	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 n_common_pathkeys,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
+	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
@@ -1655,6 +1665,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1664,7 +1675,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
-	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
+	else
+		n_common_pathkeys = 0;
+
+	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+								   NULL, n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -1908,7 +1925,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
 											 new_grpColIdx,
-											 subplan);
+											 subplan,
+											 0);
 			}
 
 			if (!rollup->is_hashed)
@@ -3848,10 +3866,15 @@ create_mergejoin_plan(PlannerInfo *root,
 	 */
 	if (best_path->outersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		outer_relids = outer_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
-												   best_path->outersortkeys,
-												   outer_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+									best_path->jpath.outerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+									   outer_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3862,10 +3885,15 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	if (best_path->innersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		inner_relids = inner_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
-												   best_path->innersortkeys,
-												   inner_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+									best_path->jpath.innerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+									   inner_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -4916,8 +4944,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
+	int			skip_cols = 0;
+
+	if (IsA(plan, IncrementalSort))
+		skip_cols = ((IncrementalSort *) plan)->skipCols;
 
-	cost_sort(&sort_path, root, NIL,
+	cost_sort(&sort_path, root, NIL, skip_cols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -5508,13 +5541,31 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	/* Always use regular sort node when enable_incrementalsort = false */
+	if (!enable_incrementalsort)
+		skipCols = 0;
+
+	if (skipCols == 0)
+	{
+		node = makeNode(Sort);
+	}
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->skipCols = skipCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5847,9 +5898,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'skipCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int skipCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5869,7 +5922,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, skipCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5912,7 +5965,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5933,7 +5986,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 static Sort *
 make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree)
+						 Plan *lefttree,
+						 int skipCols)
 {
 	List	   *sub_tlist = lefttree->targetlist;
 	ListCell   *l;
@@ -5966,7 +6020,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, skipCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6623,6 +6677,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 889e8af33b..49af1f1912 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
 #include "parser/parse_clause.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e8bc15c35d..726ddd3025 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3849,14 +3849,14 @@ create_grouping_paths(PlannerInfo *root,
 			foreach(lc, input_rel->partial_pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
-				bool		is_sorted;
+				int			n_useful_pathkeys;
 
-				is_sorted = pathkeys_contained_in(root->group_pathkeys,
-												  path->pathkeys);
-				if (path == cheapest_partial_path || is_sorted)
+				n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
 				{
 					/* Sort the cheapest partial path, if it isn't already */
-					if (!is_sorted)
+					if (n_useful_pathkeys < list_length(root->group_pathkeys))
 						path = (Path *) create_sort_path(root,
 														 grouped_rel,
 														 path,
@@ -3929,14 +3929,14 @@ create_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
-			bool		is_sorted;
+			int			n_useful_pathkeys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
-			if (path == cheapest_path || is_sorted)
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
-				if (!is_sorted)
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -5003,13 +5003,13 @@ create_ordered_paths(PlannerInfo *root,
 	foreach(lc, input_rel->pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
-		bool		is_sorted;
+		int			n_useful_pathkeys;
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+														 path->pathkeys);
+		if (path == cheapest_input_path || n_useful_pathkeys > 0)
 		{
-			if (!is_sorted)
+			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -6139,8 +6139,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->reltarget->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index b5c41241d7..1ff9d42ab4 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 2e3abeea3d..0ee6812e80 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index a24e8acfa6..f79523d697 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -989,7 +989,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_path->startup_cost;
 	sorted_p.total_cost = input_path->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0, 
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_path->rows, input_path->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 54126fbb6a..3b65ccca87 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
 }
 
 /*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
  *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
  *	  or more expensive than path2 for fetching the specified fraction
  *	  of the total tuples.
@@ -1356,12 +1356,13 @@ create_merge_append_path(PlannerInfo *root,
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
+		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
 
 		pathnode->path.rows += subpath->rows;
 		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
 			subpath->parallel_safe;
 
-		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (n_common_pathkeys == list_length(pathkeys))
 		{
 			/* Subpath is adequately ordered, we won't need to sort it */
 			input_startup_cost += subpath->startup_cost;
@@ -1375,6 +1376,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  n_common_pathkeys,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->pathtarget->width,
@@ -1622,7 +1625,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  subpath->pathtarget->width,
@@ -1715,6 +1719,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	GatherMergePath *pathnode = makeNode(GatherMergePath);
 	Cost		input_startup_cost = 0;
 	Cost		input_total_cost = 0;
+	int			n_common_pathkeys;
 
 	Assert(subpath->parallel_safe);
 	Assert(pathkeys);
@@ -1731,7 +1736,9 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.pathtarget = target ? target : rel->reltarget;
 	pathnode->path.rows += subpath->rows;
 
-	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+
+	if (n_common_pathkeys == list_length(pathkeys))
 	{
 		/* Subpath is adequately ordered, we won't need to sort it */
 		input_startup_cost += subpath->startup_cost;
@@ -1745,6 +1752,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		cost_sort(&sort_path,
 				  root,
 				  pathkeys,
+				  n_common_pathkeys,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  subpath->rows,
 				  subpath->pathtarget->width,
@@ -2601,9 +2610,31 @@ create_sort_path(PlannerInfo *root,
 				 List *pathkeys,
 				 double limit_tuples)
 {
-	SortPath   *pathnode = makeNode(SortPath);
+	SortPath   *pathnode;
+	int			n_common_pathkeys;
+
+	if (enable_incrementalsort)
+		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+	else
+		n_common_pathkeys = 0;
+
+	if (n_common_pathkeys == 0)
+	{
+		pathnode = makeNode(SortPath);
+		pathnode->path.pathtype = T_Sort;
+	}
+	else
+	{
+		IncrementalSortPath   *incpathnode;
+
+		incpathnode = makeNode(IncrementalSortPath);
+		pathnode = &incpathnode->spath;
+		pathnode->path.pathtype = T_IncrementalSort;
+		incpathnode->skipCols = n_common_pathkeys;
+	}
+
+	Assert(n_common_pathkeys < list_length(pathkeys));
 
-	pathnode->path.pathtype = T_Sort;
 	pathnode->path.parent = rel;
 	/* Sort doesn't project, so use source path's pathtarget */
 	pathnode->path.pathtarget = subpath->pathtarget;
@@ -2617,7 +2648,9 @@ create_sort_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 
-	cost_sort(&pathnode->path, root, pathkeys,
+	cost_sort(&pathnode->path, root,
+			  pathkeys, n_common_pathkeys,
+			  subpath->startup_cost,
 			  subpath->total_cost,
 			  subpath->rows,
 			  subpath->pathtarget->width,
@@ -2929,7 +2962,8 @@ create_groupingsets_path(PlannerInfo *root,
 			else
 			{
 				/* Account for cost of sort, but don't charge input cost again */
-				cost_sort(&sort_path, root, NIL,
+				cost_sort(&sort_path, root, NIL, 0,
+						  0.0,
 						  0.0,
 						  subpath->rows,
 						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 1e323d9444..8f01f05ae5 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -291,7 +291,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortCollations,
 												   qstate->sortNullsFirsts,
 												   work_mem,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ea95b8068d..abf6c3853a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3714,6 +3714,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 	return numdistinct;
 }
 
+/*
+ * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+ * 							  divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into.  Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+	ListCell   *l;
+	List	   *groupExprs = NIL;
+	double	   *result;
+	int			i;
+
+	/*
+	 * Get number of groups for each prefix of pathkeys.
+	 */
+	i = 0;
+	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+							linitial(key->pk_eclass->ec_members);
+
+		groupExprs = lappend(groupExprs, member->em_expr);
+
+		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+		i++;
+	}
+
+	return result;
+}
+
 /*
  * Estimate hash bucket statistics when the specified expression is used
  * as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0f7a96d85c..9e4ec22366 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -857,6 +857,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3c23ac75a0..118edb98a4 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -231,6 +231,13 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, fase when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -573,6 +580,9 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -607,18 +617,29 @@ static Tuplesortstate *
 tuplesort_begin_common(int workMem, bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_MINSIZE,
+										ALLOCSET_DEFAULT_INITSIZE,
+										ALLOCSET_DEFAULT_MAXSIZE);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -636,7 +657,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -654,6 +675,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -694,13 +716,14 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess)
+					 int workMem, bool randomAccess,
+					 bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -742,7 +765,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -773,7 +796,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -864,7 +887,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -939,7 +962,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -981,7 +1004,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1092,16 +1115,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_free
  *
- *	Release resources and clean up.
- *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1160,7 +1179,98 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	if (delete)
+	{
+		MemoryContextDelete(state->maincontext);
+	}
+	else
+	{
+		MemoryContextResetOnly(state->sortcontext);
+		MemoryContextResetOnly(state->tuplecontext);
+	}
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	if (spaceUsed > state->maxSpace)
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state, false);
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2949,18 +3059,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..b2e4e5061f
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1a35c5c9ad..fba6082f95 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1753,6 +1753,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "skip keys".
+ *	 SkipKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct SkipKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} SkipKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1781,6 +1795,44 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+	int64		groupsCount;	/* number of groups with equal skip keys */
+	TupleTableSlot *sampleSlot;	/* slot for sample tuple of sort group */
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index c5b5115f5b..9ae5d57449 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 02fb366680..d6d15396a2 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -750,6 +750,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			skipCols;		/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 1108b6a0ea..2baccda6ff 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1512,6 +1512,16 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			skipCols;
+} IncrementalSortPath;
+
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 5a1fbf97c3..5e4acebe41 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern bool enable_indexonlyscan;
 extern bool enable_bitmapscan;
 extern bool enable_tidscan;
 extern bool enable_sort;
+extern bool enable_incrementalsort;
 extern bool enable_hashagg;
 extern bool enable_nestloop;
 extern bool enable_material;
@@ -104,8 +105,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 						 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index ea886b6501..b4370e2621 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
@@ -226,6 +227,7 @@ extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
 extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
 							  List *mergeclauses,
 							  List *outer_pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
 extern List *truncate_useless_pathkeys(PlannerInfo *root,
 						  RelOptInfo *rel,
 						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 199a6317f5..41b7196adf 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
 extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
 					double input_rows, List **pgset);
 
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+										double tuples);
+
 extern void estimate_hash_bucket_stats(PlannerInfo *root,
 						   Node *hashkey, double nbuckets,
 						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index b6b8c8ef8c..938d329e15 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -90,7 +90,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess);
+					 int workMem, bool randomAccess,
+					 bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem, bool randomAccess);
@@ -134,6 +135,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE:  drop cascades to table matest1
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
  {3,7,8,10,13,13,16,18,19,22}
 (3 rows)
 
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Merge Append
+   Sort Key: tenk1.thousand, tenk1.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+   ->  Incremental Sort
+         Sort Key: tenk1_1.thousand, tenk1_1.thousand
+         Presorted Key: tenk1_1.thousand
+         ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Merge Append
+   Sort Key: a.thousand, a.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+   ->  Incremental Sort
+         Sort Key: b.unique2, b.unique2
+         Presorted Key: b.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 --
 -- Check that constraint exclusion works correctly with partitions using
 -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 2b738aae7c..896fdfb585 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge         | on
  enable_hashagg             | on
  enable_hashjoin            | on
+ enable_incrementalsort     | on
  enable_indexonlyscan       | on
  enable_indexscan           | on
  enable_material            | on
@@ -86,7 +87,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan             | on
  enable_sort                | on
  enable_tidscan             | on
-(14 rows)
+(15 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
@@ -607,9 +608,26 @@ SELECT
     ORDER BY f.i LIMIT 10)
 FROM generate_series(1, 3) g(i);
 
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 
 --
 -- Check that constraint exclusion works correctly with partitions using

#47

Alexander Korotkov

a.korotkov@postgrespro.ru

about 8 years ago

In reply to: Alexander Korotkov (#46)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On Fri, Dec 8, 2017 at 4:06 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

Thank you for pointing that. Sure, both cases are better. I've added
second case as well as comments. Patch is attached.

I just found that patch apply is failed according to commitfest.cputube.org.
Please, find rebased patch attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-13.patchapplication/octet-stream; name=incremental-sort-13.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641fa7..1814f98b8e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1979,27 +1979,18 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit
    Output: t1.c1, t2.c1
-   ->  Sort
+   ->  Foreign Scan
          Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
-         ->  Nested Loop
-               Output: t1.c1, t2.c1
-               ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-               ->  Materialize
-                     Output: t2.c1
-                     ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-(15 rows)
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
 
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
@@ -2016,6 +2007,44 @@ SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 1
   1 | 110
 (10 rows)
 
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Limit
+   Output: t1.c3, t2.c3
+   ->  Sort
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
+         ->  Nested Loop
+               Output: t1.c3, t2.c3
+               ->  Foreign Scan on public.ft1 t1
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
+               ->  Materialize
+                     Output: t2.c3
+                     ->  Foreign Scan on public.ft2 t2
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
+(15 rows)
+
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
 -- different server, not pushed down. No result expected.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c705f..bbf697d64b 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -508,10 +508,15 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
 -- different server, not pushed down. No result expected.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e4a01699e4..f80d396cfc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3553,6 +3553,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79e6985d0d..6cf5f8bad1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1011,6 +1015,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1611,6 +1618,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -1936,14 +1949,37 @@ static void
 show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 {
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+	int			skipCols;
+
+	if (IsA(plan, IncrementalSort))
+		skipCols = ((IncrementalSort *) plan)->skipCols;
+	else
+		skipCols = 0;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, skipCols, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->skipCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -1954,7 +1990,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -1978,7 +2014,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2047,7 +2083,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2104,7 +2140,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2117,13 +2153,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2163,9 +2200,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2373,6 +2414,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->groupsCount);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyLong("Sort Groups: %ld",
+								incrsortstate->groupsCount, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		groupsCount;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyLong("Sort Groups", groupsCount, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f8b72ebab9..490d6dd76c 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,10 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 		case T_SortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+			break;
 
 		default:
 			break;
@@ -976,6 +989,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1225,6 +1241,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..bc92c3d0e7 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort(
+									(IncrementalSort *) node, estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 46ee880415..30855c3fe7 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -667,6 +667,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
+												  false,
 												  false);
 	}
 
@@ -754,7 +755,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, false);
+									 work_mem, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1a1e48fb77
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,649 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is specially optimized kind of multikey sort when
+ *		input is already presorted by prefix of required keys list.  Thus,
+ *		when it's required to sort by (key1, key2 ... keyN) and result is
+ *		already sorted by (key1, key2 ... keyM), M < N, we sort groups where
+ *		values of (key1, key2 ... keyM) are equal.
+ *
+ *		Consider following example.  We have input tuples consisting from
+ *		two integers (x, y) already presorted by x, while it's required to
+ *		sort them by x and y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 10)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would sort by y following groups, which have
+ *		equal x, individually:
+ *			(1, 5) (1, 2)
+ *			(2, 10) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		following tuple set which is actually sorted by x and y.
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 10)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort is faster than full sort on large datasets.  But
+ *		the case of most huge benefit of incremental sort is queries with
+ *		LIMIT because incremental sort can return first tuples without reading
+ *		whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for skipKeys comparison.
+ */
+static void
+prepareSkipCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					skipCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	skipCols = plannode->skipCols;
+
+	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+	for (i = 0; i < skipCols; i++)
+	{
+		Oid equalityOp, equalityFunc;
+		SkipKeyData *key;
+
+		key = &node->skipKeys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check if first "skipCols" sort values are equal.
+ */
+static bool
+cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+															TupleTableSlot *b)
+{
+	int n, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+	for (i = 0; i < n; i++)
+	{
+		Datum datumA, datumB, result;
+		bool isnullA, isnullB;
+		AttrNumber attno = node->skipKeys[i].attno;
+		SkipKeyData *key;
+
+		datumA = slot_getattr(a, attno, &isnullA);
+		datumB = slot_getattr(b, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->skipKeys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Copying of tuples to the node->sampleSlot introduces some overhead.  It's
+ * especially notable when groups are containing one or few tuples.  In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from sorted set if any.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * If first time through, read all tuples from outer plan and pass them to
+	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	/*
+	 * Initialize tuplesort module.
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "calling tuplesort_begin");
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortSkipCols - already
+		 * sorted columns.
+		 */
+		prepareSkipCols(node);
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->groupsCount++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* Put saved tuple to tuplesort if any */
+	if (!TupIsNull(node->sampleSlot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+		ExecClearTuple(node->sampleSlot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where skipCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/* Put next group of presorted data to the tuplesort */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Save last tuple in minimal group */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->sampleSlot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/* Iterate while skip cols are the same as in saved tuple */
+			bool	cmp;
+			cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+			if (cmp)
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->sampleSlot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+															node->groupsCount;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->sampleSlot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->groupsCount = 0;
+	incrsortstate->skipKeys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * tuple table initialization
+	 *
+	 * sort nodes only return scan tuples from their sorted relation.
+	 */
+	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * initialize tuple type.  no need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortReInitializeDSM
+ *
+ *		Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	/* If there's any instrumentation space, clear it for next time */
+	if (node->shared_info != NULL)
+	{
+		memset(node->shared_info->sinfo, 0,
+			   node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 9c68de8565..90c82af17f 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  node->randomAccess);
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ddbbc79823..94d5ba0e41 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -919,6 +919,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -930,13 +948,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(skipCols);
 
 	return newnode;
 }
@@ -4817,6 +4851,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5e72df137e..415a9e9b19 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -870,12 +870,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -897,6 +895,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(skipCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3739,6 +3755,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9925866b53..99d6938ddc 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2060,12 +2060,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2074,6 +2075,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(skipCols);
 
 	READ_DONE();
 }
@@ -2636,6 +2663,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 7))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 12a6ee4a22..e96c5fe137 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3613,6 +3613,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8679b14b29..05f58fff79 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -121,6 +121,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1605,6 +1606,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  *	  Determines and returns the cost of sorting a relation, including
  *	  the cost of reading the input data.
  *
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys.  In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys.  And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly.  Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
  * comparisons for t tuples.
@@ -1631,7 +1639,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * work that has to be done to prepare the inputs to the comparison operators.
  *
  * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
@@ -1647,19 +1657,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
+	Cost		startup_cost = input_startup_cost;
+	Cost		run_cost = 0,
+				rest_cost,
+				group_cost,
+				input_run_cost = input_total_cost - input_startup_cost;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
+	double		num_groups,
+				group_input_bytes,
+				group_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
 	if (!enable_sort)
 		startup_cost += disable_cost;
+	if (!enable_incrementalsort)
+		presorted_keys = 0;
 
 	path->rows = tuples;
 
@@ -1685,13 +1704,50 @@ cost_sort(Path *path, PlannerInfo *root,
 		output_bytes = input_bytes;
 	}
 
-	if (output_bytes > sort_mem_bytes)
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	if (presorted_keys > 0)
+	{
+		List	   *presortedExprs = NIL;
+		ListCell   *l;
+		int			i = 0;
+
+		/* Extract presorted keys as list of expressions */
+		foreach(l, pathkeys)
+		{
+			PathKey *key = (PathKey *)lfirst(l);
+			EquivalenceMember *member = (EquivalenceMember *)
+										linitial(key->pk_eclass->ec_members);
+
+			presortedExprs = lappend(presortedExprs, member->em_expr);
+
+			i++;
+			if (i >= presorted_keys)
+				break;
+		}
+
+		/* Estimate number of groups with equal presorted keys */
+		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+	}
+	else
+	{
+		num_groups = 1.0;
+	}
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.
+	 */
+	group_input_bytes = input_bytes / num_groups;
+	group_tuples = tuples / num_groups;
+	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
-		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = input_bytes / sort_mem_bytes;
+		double		npages = ceil(group_input_bytes / BLCKSZ);
+		double		nruns = group_input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1701,7 +1757,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 
 		/* Disk costs */
 
@@ -1712,10 +1768,10 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		group_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
-	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1723,14 +1779,33 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
-		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		/*
+		 * We'll use plain quicksort on all the input tuples.  If it appears
+		 * that we expect less than two tuples per sort group then assume
+		 * logarithmic part of estimate to be 1.
+		 */
+		if (group_tuples >= 2.0)
+			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+		else
+			group_cost = comparison_cost * group_tuples;
 	}
 
+	/* Add per group cost of fetching tuples from input */
+	group_cost += input_run_cost / num_groups;
+
+	/*
+	 * We've to sort first group to start output from node. Sorting rest of
+	 * groups are required to return all the other tuples.
+	 */
+	startup_cost += group_cost;
+	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+	if (rest_cost > 0.0)
+		run_cost += rest_cost;
+
 	/*
 	 * Also charge a small amount (arbitrarily set equal to operator cost) per
 	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
@@ -1741,6 +1816,19 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	run_cost += cpu_operator_cost * tuples;
 
+	/* Extra costs of incremental sort */
+	if (presorted_keys > 0)
+	{
+		/*
+		 * In incremental sort case we also have to cost to detect sort groups.
+		 * It turns out into extra copy and comparison for each tuple.
+		 */
+		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+		/* Cost of per group tuplesort reset */
+		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+	}
+
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
@@ -2717,6 +2805,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->pathtarget->width,
@@ -2743,6 +2833,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ef58cff28d..329ba7b532 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,33 @@ compare_pathkeys(List *keys1, List *keys2)
 	return PATHKEYS_EQUAL;
 }
 
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int n;
+	ListCell   *key1,
+			   *key2;
+	n = 0;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+			return n;
+		n++;
+	}
+
+	return n;
+}
+
+
 /*
  * pathkeys_contained_in
  *	  Common special case of compare_pathkeys: we just want to know
@@ -1488,26 +1517,42 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
  */
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 {
-	if (root->query_pathkeys == NIL)
+	int	n_common_pathkeys;
+
+	if (query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
+
+	if (enable_incrementalsort)
 	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		/*
+		 * Return the number of path keys in common, or 0 if there are none. Any
+		 * first common pathkeys could be useful for ordering because we can use
+		 * incremental sort.
+		 */
+		return n_common_pathkeys;
+	}
+	else
+	{
+		/*
+		 * When incremental sort is disabled, pathkeys are useful only when they
+		 * do contain all the query pathkeys.
+		 */
+		if (n_common_pathkeys == list_length(query_pathkeys))
+			return n_common_pathkeys;
+		else
+			return 0;
 	}
-
-	return 0;					/* path ordering not useful */
 }
 
 /*
@@ -1523,7 +1568,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
 	int			nuseful2;
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
-	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283d6b..133435f516 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int skipCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree);
+						 Plan *lefttree,
+						 int skipCols);
 static Material *make_material(Plan *lefttree);
 static WindowAgg *make_windowagg(List *tlist, Index winref,
 			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -437,6 +438,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1122,6 +1124,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		Oid		   *sortOperators;
 		Oid		   *collations;
 		bool	   *nullsFirst;
+		int			n_common_pathkeys;
 
 		/* Build the child plan */
 		/* Must insist that all children return the same tlist */
@@ -1156,9 +1159,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+		if (n_common_pathkeys < list_length(pathkeys))
 		{
 			Sort	   *sort = make_sort(subplan, numsortkeys,
+										 n_common_pathkeys,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1508,6 +1513,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 	Plan	   *subplan;
 	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	int			n_common_pathkeys;
 
 	/* As with Gather, it's best to project away columns in the workers. */
 	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1537,12 +1543,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
+	if (n_common_pathkeys < list_length(pathkeys))
+	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 n_common_pathkeys,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
+	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
@@ -1655,6 +1665,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1664,7 +1675,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
-	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
+	else
+		n_common_pathkeys = 0;
+
+	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+								   NULL, n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -1908,7 +1925,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
 											 new_grpColIdx,
-											 subplan);
+											 subplan,
+											 0);
 			}
 
 			if (!rollup->is_hashed)
@@ -3848,10 +3866,15 @@ create_mergejoin_plan(PlannerInfo *root,
 	 */
 	if (best_path->outersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		outer_relids = outer_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
-												   best_path->outersortkeys,
-												   outer_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+									best_path->jpath.outerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+									   outer_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3862,10 +3885,15 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	if (best_path->innersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		inner_relids = inner_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
-												   best_path->innersortkeys,
-												   inner_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+									best_path->jpath.innerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+									   inner_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -4927,8 +4955,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
+	int			skip_cols = 0;
+
+	if (IsA(plan, IncrementalSort))
+		skip_cols = ((IncrementalSort *) plan)->skipCols;
 
-	cost_sort(&sort_path, root, NIL,
+	cost_sort(&sort_path, root, NIL, skip_cols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -5519,13 +5552,31 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	/* Always use regular sort node when enable_incrementalsort = false */
+	if (!enable_incrementalsort)
+		skipCols = 0;
+
+	if (skipCols == 0)
+	{
+		node = makeNode(Sort);
+	}
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->skipCols = skipCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5858,9 +5909,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'skipCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int skipCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5880,7 +5933,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, skipCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5923,7 +5976,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5944,7 +5997,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 static Sort *
 make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree)
+						 Plan *lefttree,
+						 int skipCols)
 {
 	List	   *sub_tlist = lefttree->targetlist;
 	ListCell   *l;
@@ -5977,7 +6031,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, skipCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6633,6 +6687,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
 #include "parser/parse_clause.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b52dadd81..3842271245 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3849,14 +3849,14 @@ create_grouping_paths(PlannerInfo *root,
 			foreach(lc, input_rel->partial_pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
-				bool		is_sorted;
+				int			n_useful_pathkeys;
 
-				is_sorted = pathkeys_contained_in(root->group_pathkeys,
-												  path->pathkeys);
-				if (path == cheapest_partial_path || is_sorted)
+				n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
 				{
 					/* Sort the cheapest partial path, if it isn't already */
-					if (!is_sorted)
+					if (n_useful_pathkeys < list_length(root->group_pathkeys))
 						path = (Path *) create_sort_path(root,
 														 grouped_rel,
 														 path,
@@ -3929,14 +3929,14 @@ create_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
-			bool		is_sorted;
+			int			n_useful_pathkeys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
-			if (path == cheapest_path || is_sorted)
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
-				if (!is_sorted)
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -5003,13 +5003,13 @@ create_ordered_paths(PlannerInfo *root,
 	foreach(lc, input_rel->pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
-		bool		is_sorted;
+		int			n_useful_pathkeys;
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+														 path->pathkeys);
+		if (path == cheapest_input_path || n_useful_pathkeys > 0)
 		{
-			if (!is_sorted)
+			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -6139,8 +6139,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->reltarget->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 46367cba63..616ad1a474 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 5a08e75ad5..eb95ca4c5e 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -983,7 +983,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_path->startup_cost;
 	sorted_p.total_cost = input_path->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0, 
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_path->rows, input_path->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7df8761710..9c6f910f14 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
 }
 
 /*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
  *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
  *	  or more expensive than path2 for fetching the specified fraction
  *	  of the total tuples.
@@ -1356,12 +1356,13 @@ create_merge_append_path(PlannerInfo *root,
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
+		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
 
 		pathnode->path.rows += subpath->rows;
 		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
 			subpath->parallel_safe;
 
-		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (n_common_pathkeys == list_length(pathkeys))
 		{
 			/* Subpath is adequately ordered, we won't need to sort it */
 			input_startup_cost += subpath->startup_cost;
@@ -1375,6 +1376,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  n_common_pathkeys,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->pathtarget->width,
@@ -1622,7 +1625,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  subpath->pathtarget->width,
@@ -1715,6 +1719,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	GatherMergePath *pathnode = makeNode(GatherMergePath);
 	Cost		input_startup_cost = 0;
 	Cost		input_total_cost = 0;
+	int			n_common_pathkeys;
 
 	Assert(subpath->parallel_safe);
 	Assert(pathkeys);
@@ -1731,7 +1736,9 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.pathtarget = target ? target : rel->reltarget;
 	pathnode->path.rows += subpath->rows;
 
-	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+
+	if (n_common_pathkeys == list_length(pathkeys))
 	{
 		/* Subpath is adequately ordered, we won't need to sort it */
 		input_startup_cost += subpath->startup_cost;
@@ -1745,6 +1752,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		cost_sort(&sort_path,
 				  root,
 				  pathkeys,
+				  n_common_pathkeys,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  subpath->rows,
 				  subpath->pathtarget->width,
@@ -2604,9 +2613,31 @@ create_sort_path(PlannerInfo *root,
 				 List *pathkeys,
 				 double limit_tuples)
 {
-	SortPath   *pathnode = makeNode(SortPath);
+	SortPath   *pathnode;
+	int			n_common_pathkeys;
+
+	if (enable_incrementalsort)
+		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+	else
+		n_common_pathkeys = 0;
+
+	if (n_common_pathkeys == 0)
+	{
+		pathnode = makeNode(SortPath);
+		pathnode->path.pathtype = T_Sort;
+	}
+	else
+	{
+		IncrementalSortPath   *incpathnode;
+
+		incpathnode = makeNode(IncrementalSortPath);
+		pathnode = &incpathnode->spath;
+		pathnode->path.pathtype = T_IncrementalSort;
+		incpathnode->skipCols = n_common_pathkeys;
+	}
+
+	Assert(n_common_pathkeys < list_length(pathkeys));
 
-	pathnode->path.pathtype = T_Sort;
 	pathnode->path.parent = rel;
 	/* Sort doesn't project, so use source path's pathtarget */
 	pathnode->path.pathtarget = subpath->pathtarget;
@@ -2620,7 +2651,9 @@ create_sort_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 
-	cost_sort(&pathnode->path, root, pathkeys,
+	cost_sort(&pathnode->path, root,
+			  pathkeys, n_common_pathkeys,
+			  subpath->startup_cost,
 			  subpath->total_cost,
 			  subpath->rows,
 			  subpath->pathtarget->width,
@@ -2932,7 +2965,8 @@ create_groupingsets_path(PlannerInfo *root,
 			else
 			{
 				/* Account for cost of sort, but don't charge input cost again */
-				cost_sort(&sort_path, root, NIL,
+				cost_sort(&sort_path, root, NIL, 0,
+						  0.0,
 						  0.0,
 						  subpath->rows,
 						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 79dbfd1a05..e3e984b3da 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -291,7 +291,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortCollations,
 												   qstate->sortNullsFirsts,
 												   work_mem,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fcc8323f62..4726bee850 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3714,6 +3714,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 	return numdistinct;
 }
 
+/*
+ * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+ * 							  divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into.  Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+	ListCell   *l;
+	List	   *groupExprs = NIL;
+	double	   *result;
+	int			i;
+
+	/*
+	 * Get number of groups for each prefix of pathkeys.
+	 */
+	i = 0;
+	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+							linitial(key->pk_eclass->ec_members);
+
+		groupExprs = lappend(groupExprs, member->em_expr);
+
+		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+		i++;
+	}
+
+	return result;
+}
+
 /*
  * Estimate hash bucket statistics when the specified expression is used
  * as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 72f6be329e..bea4f00421 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -857,6 +857,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index eecc66cafa..80bc67c093 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -231,6 +231,13 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, fase when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -573,6 +580,9 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -607,18 +617,27 @@ static Tuplesortstate *
 tuplesort_begin_common(int workMem, bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -636,7 +655,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -654,6 +673,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -694,13 +714,14 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess)
+					 int workMem, bool randomAccess,
+					 bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -742,7 +763,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -773,7 +794,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -864,7 +885,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -939,7 +960,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -981,7 +1002,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1092,16 +1113,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1160,7 +1177,98 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	if (delete)
+	{
+		MemoryContextDelete(state->maincontext);
+	}
+	else
+	{
+		MemoryContextResetOnly(state->sortcontext);
+		MemoryContextResetOnly(state->tuplecontext);
+	}
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	if (spaceUsed > state->maxSpace)
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state, false);
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2944,18 +3052,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..b2e4e5061f
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2a4f7407a1..4180f57e88 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1754,6 +1754,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "skip keys".
+ *	 SkipKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct SkipKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} SkipKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1782,6 +1796,44 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+	int64		groupsCount;	/* number of groups with equal skip keys */
+	TupleTableSlot *sampleSlot;	/* slot for sample tuple of sort group */
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2eb3d6d371..b6a9d6c597 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5f7b..033ec416fe 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -750,6 +750,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			skipCols;		/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 71689b8ed6..0d072fd7c3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1513,6 +1513,16 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			skipCols;
+} IncrementalSortPath;
+
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d2fff76653..45cfbee724 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern bool enable_indexonlyscan;
 extern bool enable_bitmapscan;
 extern bool enable_tidscan;
 extern bool enable_sort;
+extern bool enable_incrementalsort;
 extern bool enable_hashagg;
 extern bool enable_nestloop;
 extern bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 						 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0072b7aa0d..d6b8841d33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
@@ -226,6 +227,7 @@ extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
 extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
 							  List *mergeclauses,
 							  List *outer_pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
 extern List *truncate_useless_pathkeys(PlannerInfo *root,
 						  RelOptInfo *rel,
 						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
 extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
 					double input_rows, List **pgset);
 
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+										double tuples);
+
 extern void estimate_hash_bucket_stats(PlannerInfo *root,
 						   Node *hashkey, double nbuckets,
 						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5d57c503ab..9a5b7f8d3c 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -90,7 +90,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess);
+					 int workMem, bool randomAccess,
+					 bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem, bool randomAccess);
@@ -134,6 +135,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE:  drop cascades to table matest1
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
  {3,7,8,10,13,13,16,18,19,22}
 (3 rows)
 
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Merge Append
+   Sort Key: tenk1.thousand, tenk1.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+   ->  Incremental Sort
+         Sort Key: tenk1_1.thousand, tenk1_1.thousand
+         Presorted Key: tenk1_1.thousand
+         ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Merge Append
+   Sort Key: a.thousand, a.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+   ->  Incremental Sort
+         Sort Key: b.unique2, b.unique2
+         Presorted Key: b.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 --
 -- Check that constraint exclusion works correctly with partitions using
 -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index c9c8f51e1c..898361d6b3 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge         | on
  enable_hashagg             | on
  enable_hashjoin            | on
+ enable_incrementalsort     | on
  enable_indexonlyscan       | on
  enable_indexscan           | on
  enable_material            | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan             | on
  enable_sort                | on
  enable_tidscan             | on
-(15 rows)
+(16 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
@@ -607,9 +608,26 @@ SELECT
     ORDER BY f.i LIMIT 10)
 FROM generate_series(1, 3) g(i);
 
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 
 --
 -- Check that constraint exclusion works correctly with partitions using

#48

Tels

nospam-pg-abuse@bloodgate.com

about 8 years ago

In reply to: Alexander Korotkov (#47)

Re: [HACKERS] [PATCH] Incremental sort

Hello Alexander,

On Thu, January 4, 2018 4:36 pm, Alexander Korotkov wrote:

On Fri, Dec 8, 2017 at 4:06 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

Thank you for pointing that. Sure, both cases are better. I've added
second case as well as comments. Patch is attached.

I had a quick look, this isn't a full review, but a few things struck me
on a read through the diff:

There are quite a few places where lines are broken like so:

+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
Or like this:

+			result = (PlanState *) ExecInitIncrementalSort(
+									(IncrementalSort *) node, estate, eflags);

e.g. a param is on the next line, but aligned to the very same place where
it would be w/o the linebreak. Or is this just some sort of artefact
because I viewed the diff with tabspacing = 8?

I'd fix the grammar here:

+ *		Incremental sort is specially optimized kind of multikey sort when
+ *		input is already presorted by prefix of required keys list.

Like so:

"Incremental sort is a specially optimized kind of multikey sort used when
the input is already presorted by a prefix of the required keys list."

+ * Consider following example. We have input tuples consisting from

"Consider the following example: We have ..."

+ * In incremental sort case we also have to cost to detect sort groups.

"we also have to cost the detection of sort groups."

"+ * It turns out into extra copy and comparison for each tuple."

"This turns out to be one extra copy and comparison per tuple."

Should probably be 2018 now - time flies fast :)

 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 7))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();

I think the ", 7" here is left-over from when it was named "INCSORT", and
it should be MATCH("INCREMENTALSORT", 15)), shouldn't it?

+ space, fase when it's value for in-memory

typo: "space, false when ..."

+			bool	cmp;
+			cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+			if (cmp)

In the above, the variable cmp could be optimized away with:

+ if (cmpSortSkipCols(node, node->sampleSlot, slot))

(not sure if modern compilers won't do this, anway, though)

+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set
bounded? */
+	int64		bound;			/* if bounded, how many
tuples are needed */

If I'm not wrong, the layout of the struct will include quite a bit of
padding on 64 bit due to the mixing of bool and int64, maybe it would be
better to sort the fields differently, e.g. pack 4 or 8 bools together?
Not sure if that makes much of a difference, though.

That's all for now :)

Thank you for your work,

Tels

#49

Alexander Korotkov

a.korotkov@postgrespro.ru

about 8 years ago

In reply to: Tels (#48)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

Hi!

On Fri, Jan 5, 2018 at 2:21 AM, Tels <nospam-pg-abuse@bloodgate.com> wrote:

On Thu, January 4, 2018 4:36 pm, Alexander Korotkov wrote:

On Fri, Dec 8, 2017 at 4:06 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

Thank you for pointing that. Sure, both cases are better. I've added
second case as well as comments. Patch is attached.

I had a quick look, this isn't a full review, but a few things struck me
on a read through the diff:

There are quite a few places where lines are broken like so:
+                       ExecIncrementalSortInitializeWorker((IncrementalSortState
*) planstate,
+
pwcxt);

It's quite common practice to align second argument to the same position as
first argument. See other lines nearby.

Or like this:

+                       result = (PlanState *) ExecInitIncrementalSort(
+
(IncrementalSort *) node, estate, eflags);

It was probably not so good idea to insert line break before first
argument. Fixed.

e.g. a param is on the next line, but aligned to the very same place where
it would be w/o the linebreak. Or is this just some sort of artefact
because I viewed the diff with tabspacing = 8?

I'd fix the grammar here:
+ *             Incremental sort is specially optimized kind of multikey
sort when
+ *             input is already presorted by prefix of required keys list.
Like so:

"Incremental sort is a specially optimized kind of multikey sort used when
the input is already presorted by a prefix of the required keys list."

+ * Consider following example. We have input tuples
consisting from

"Consider the following example: We have ..."

+ * In incremental sort case we also have to cost to detect
sort groups.

"we also have to cost the detection of sort groups."

"+ * It turns out into extra copy and comparison for each
tuple."

"This turns out to be one extra copy and comparison per tuple."

Many thanks for noticing these. Fixed.

+ "Portions Copyright (c) 1996-2017"

Should probably be 2018 now - time flies fast :)

Right. Happy New Year! :)

return_value = _readMaterial();
else if (MATCH("SORT", 4))
return_value = _readSort();
+       else if (MATCH("INCREMENTALSORT", 7))
+               return_value = _readIncrementalSort();
else if (MATCH("GROUP", 5))
return_value = _readGroup();
I think the ", 7" here is left-over from when it was named "INCSORT", and
it should be MATCH("INCREMENTALSORT", 15)), shouldn't it?

Good catch, thank you!

+ space,
fase when it's value for in-memory

typo: "space, false when ..."

Right. Fixed.

+                       bool    cmp;
+                       cmp = cmpSortSkipCols(node, node->sampleSlot,
slot);
+
+                       if (cmp)
In the above, the variable cmp could be optimized away with:

+ if (cmpSortSkipCols(node, node->sampleSlot, slot))

Right. This comes from time when there was more complicated code which
have to use the cmp variable multiple times.

(not sure if modern compilers won't do this, anway, though)

Anyway, it's code simplification which is good regardless whether compilers
able to do it themselves or not.

+typedef struct IncrementalSortState

+{
+       ScanState       ss;                             /* its first field
is NodeTag */
+       bool            bounded;                /* is the result set
bounded? */
+       int64           bound;                  /* if bounded, how many
tuples are needed */
If I'm not wrong, the layout of the struct will include quite a bit of
padding on 64 bit due to the mixing of bool and int64, maybe it would be
better to sort the fields differently, e.g. pack 4 or 8 bools together?
Not sure if that makes much of a difference, though.

I'd like to leave common members between of SortState and
IncrementalSortState to be ordered the same way.
Thus, I think that if we're going to reorder then we should do this in both
data structures.
But I'm not sure it worth considering, because these data structures are
very unlikely be the source of significant memory consumption...

That's all for now :)

Great, thank you for review.

BTW, I also fixed documentation markup (regarding migration to xml).

Rebased patch is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-14.patchapplication/octet-stream; name=incremental-sort-14.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641fa7..1814f98b8e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1979,27 +1979,18 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit
    Output: t1.c1, t2.c1
-   ->  Sort
+   ->  Foreign Scan
          Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
-         ->  Nested Loop
-               Output: t1.c1, t2.c1
-               ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-               ->  Materialize
-                     Output: t2.c1
-                     ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-(15 rows)
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
 
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
@@ -2016,6 +2007,44 @@ SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 1
   1 | 110
 (10 rows)
 
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Limit
+   Output: t1.c3, t2.c3
+   ->  Sort
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
+         ->  Nested Loop
+               Output: t1.c3, t2.c3
+               ->  Foreign Scan on public.ft1 t1
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
+               ->  Materialize
+                     Output: t2.c3
+                     ->  Foreign Scan on public.ft2 t2
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
+(15 rows)
+
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
 -- different server, not pushed down. No result expected.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c705f..bbf697d64b 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -508,10 +508,15 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, pushed down, thanks to incremental sort on remote side
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
+-- can't perform top-N sort like local side can.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
 -- different server, not pushed down. No result expected.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e4a01699e4..fdcdc6683f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3553,6 +3553,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79e6985d0d..6cf5f8bad1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1011,6 +1015,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1611,6 +1618,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -1936,14 +1949,37 @@ static void
 show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 {
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+	int			skipCols;
+
+	if (IsA(plan, IncrementalSort))
+		skipCols = ((IncrementalSort *) plan)->skipCols;
+	else
+		skipCols = 0;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, skipCols, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->skipCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -1954,7 +1990,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -1978,7 +2014,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2047,7 +2083,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2104,7 +2140,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2117,13 +2153,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2163,9 +2200,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2373,6 +2414,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->groupsCount);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyLong("Sort Groups: %ld",
+								incrsortstate->groupsCount, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		groupsCount;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyLong("Sort Groups", groupsCount, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f8b72ebab9..490d6dd76c 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,10 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 		case T_SortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+			break;
 
 		default:
 			break;
@@ -976,6 +989,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1225,6 +1241,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 46ee880415..30855c3fe7 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -667,6 +667,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
+												  false,
 												  false);
 	}
 
@@ -754,7 +755,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, false);
+									 work_mem, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..a8e55e5e2d
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,646 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is a specially optimized kind of multikey sort used
+ *		when the input is already presorted by a prefix of the required keys
+ *		list.  Thus, when it's required to sort by (key1, key2 ... keyN) and
+ *		result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ *		where values of (key1, key2 ... keyM) are equal.
+ *
+ *		Consider the following example.  We have input tuples consisting from
+ *		two integers (x, y) already presorted by x, while it's required to
+ *		sort them by x and y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 10)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would sort by y following groups, which have
+ *		equal x, individually:
+ *			(1, 5) (1, 2)
+ *			(2, 10) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		following tuple set which is actually sorted by x and y.
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 10)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort is faster than full sort on large datasets.  But
+ *		the case of most huge benefit of incremental sort is queries with
+ *		LIMIT because incremental sort can return first tuples without reading
+ *		whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for skipKeys comparison.
+ */
+static void
+prepareSkipCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					skipCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	skipCols = plannode->skipCols;
+
+	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+	for (i = 0; i < skipCols; i++)
+	{
+		Oid equalityOp, equalityFunc;
+		SkipKeyData *key;
+
+		key = &node->skipKeys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check if first "skipCols" sort values are equal.
+ */
+static bool
+cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+															TupleTableSlot *b)
+{
+	int n, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+	for (i = 0; i < n; i++)
+	{
+		Datum datumA, datumB, result;
+		bool isnullA, isnullB;
+		AttrNumber attno = node->skipKeys[i].attno;
+		SkipKeyData *key;
+
+		datumA = slot_getattr(a, attno, &isnullA);
+		datumB = slot_getattr(b, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->skipKeys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Copying of tuples to the node->sampleSlot introduces some overhead.  It's
+ * especially notable when groups are containing one or few tuples.  In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from sorted set if any.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * If first time through, read all tuples from outer plan and pass them to
+	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	/*
+	 * Initialize tuplesort module.
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "calling tuplesort_begin");
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortSkipCols - already
+		 * sorted columns.
+		 */
+		prepareSkipCols(node);
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->groupsCount++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* Put saved tuple to tuplesort if any */
+	if (!TupIsNull(node->sampleSlot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+		ExecClearTuple(node->sampleSlot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where skipCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/* Put next group of presorted data to the tuplesort */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Save last tuple in minimal group */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->sampleSlot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/* Iterate while skip cols are the same as in saved tuple */
+			if (cmpSortSkipCols(node, node->sampleSlot, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->sampleSlot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+															node->groupsCount;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->sampleSlot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->groupsCount = 0;
+	incrsortstate->skipKeys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * tuple table initialization
+	 *
+	 * sort nodes only return scan tuples from their sorted relation.
+	 */
+	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * initialize tuple type.  no need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortReInitializeDSM
+ *
+ *		Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	/* If there's any instrumentation space, clear it for next time */
+	if (node->shared_info != NULL)
+	{
+		memset(node->shared_info->sinfo, 0,
+			   node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 9c68de8565..90c82af17f 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  node->randomAccess);
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ddbbc79823..94d5ba0e41 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -919,6 +919,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -930,13 +948,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(skipCols);
 
 	return newnode;
 }
@@ -4817,6 +4851,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5e72df137e..415a9e9b19 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -870,12 +870,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -897,6 +895,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(skipCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3739,6 +3755,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9925866b53..9f64d50103 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2060,12 +2060,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2074,6 +2075,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(skipCols);
 
 	READ_DONE();
 }
@@ -2636,6 +2663,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 12a6ee4a22..e96c5fe137 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3613,6 +3613,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8679b14b29..fd0ba203d5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -121,6 +121,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1605,6 +1606,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  *	  Determines and returns the cost of sorting a relation, including
  *	  the cost of reading the input data.
  *
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys.  In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys.  And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly.  Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
  * comparisons for t tuples.
@@ -1631,7 +1639,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * work that has to be done to prepare the inputs to the comparison operators.
  *
  * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
@@ -1647,19 +1657,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
+	Cost		startup_cost = input_startup_cost;
+	Cost		run_cost = 0,
+				rest_cost,
+				group_cost,
+				input_run_cost = input_total_cost - input_startup_cost;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
+	double		num_groups,
+				group_input_bytes,
+				group_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
 	if (!enable_sort)
 		startup_cost += disable_cost;
+	if (!enable_incrementalsort)
+		presorted_keys = 0;
 
 	path->rows = tuples;
 
@@ -1685,13 +1704,50 @@ cost_sort(Path *path, PlannerInfo *root,
 		output_bytes = input_bytes;
 	}
 
-	if (output_bytes > sort_mem_bytes)
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	if (presorted_keys > 0)
+	{
+		List	   *presortedExprs = NIL;
+		ListCell   *l;
+		int			i = 0;
+
+		/* Extract presorted keys as list of expressions */
+		foreach(l, pathkeys)
+		{
+			PathKey *key = (PathKey *)lfirst(l);
+			EquivalenceMember *member = (EquivalenceMember *)
+										linitial(key->pk_eclass->ec_members);
+
+			presortedExprs = lappend(presortedExprs, member->em_expr);
+
+			i++;
+			if (i >= presorted_keys)
+				break;
+		}
+
+		/* Estimate number of groups with equal presorted keys */
+		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+	}
+	else
+	{
+		num_groups = 1.0;
+	}
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.
+	 */
+	group_input_bytes = input_bytes / num_groups;
+	group_tuples = tuples / num_groups;
+	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
-		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = input_bytes / sort_mem_bytes;
+		double		npages = ceil(group_input_bytes / BLCKSZ);
+		double		nruns = group_input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1701,7 +1757,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 
 		/* Disk costs */
 
@@ -1712,10 +1768,10 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		group_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
-	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1723,14 +1779,33 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
-		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		/*
+		 * We'll use plain quicksort on all the input tuples.  If it appears
+		 * that we expect less than two tuples per sort group then assume
+		 * logarithmic part of estimate to be 1.
+		 */
+		if (group_tuples >= 2.0)
+			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+		else
+			group_cost = comparison_cost * group_tuples;
 	}
 
+	/* Add per group cost of fetching tuples from input */
+	group_cost += input_run_cost / num_groups;
+
+	/*
+	 * We've to sort first group to start output from node. Sorting rest of
+	 * groups are required to return all the other tuples.
+	 */
+	startup_cost += group_cost;
+	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+	if (rest_cost > 0.0)
+		run_cost += rest_cost;
+
 	/*
 	 * Also charge a small amount (arbitrarily set equal to operator cost) per
 	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
@@ -1741,6 +1816,20 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	run_cost += cpu_operator_cost * tuples;
 
+	/* Extra costs of incremental sort */
+	if (presorted_keys > 0)
+	{
+		/*
+		 * In incremental sort case we also have to cost the detection of
+		 * sort groups.  This turns out to be one extra copy and comparison
+		 * per tuple.
+		 */
+		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+		/* Cost of per group tuplesort reset */
+		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+	}
+
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
@@ -2717,6 +2806,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->pathtarget->width,
@@ -2743,6 +2834,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ef58cff28d..329ba7b532 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,33 @@ compare_pathkeys(List *keys1, List *keys2)
 	return PATHKEYS_EQUAL;
 }
 
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int n;
+	ListCell   *key1,
+			   *key2;
+	n = 0;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+			return n;
+		n++;
+	}
+
+	return n;
+}
+
+
 /*
  * pathkeys_contained_in
  *	  Common special case of compare_pathkeys: we just want to know
@@ -1488,26 +1517,42 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
  */
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 {
-	if (root->query_pathkeys == NIL)
+	int	n_common_pathkeys;
+
+	if (query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
+
+	if (enable_incrementalsort)
 	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		/*
+		 * Return the number of path keys in common, or 0 if there are none. Any
+		 * first common pathkeys could be useful for ordering because we can use
+		 * incremental sort.
+		 */
+		return n_common_pathkeys;
+	}
+	else
+	{
+		/*
+		 * When incremental sort is disabled, pathkeys are useful only when they
+		 * do contain all the query pathkeys.
+		 */
+		if (n_common_pathkeys == list_length(query_pathkeys))
+			return n_common_pathkeys;
+		else
+			return 0;
 	}
-
-	return 0;					/* path ordering not useful */
 }
 
 /*
@@ -1523,7 +1568,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
 	int			nuseful2;
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
-	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283d6b..133435f516 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int skipCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree);
+						 Plan *lefttree,
+						 int skipCols);
 static Material *make_material(Plan *lefttree);
 static WindowAgg *make_windowagg(List *tlist, Index winref,
 			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -437,6 +438,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1122,6 +1124,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		Oid		   *sortOperators;
 		Oid		   *collations;
 		bool	   *nullsFirst;
+		int			n_common_pathkeys;
 
 		/* Build the child plan */
 		/* Must insist that all children return the same tlist */
@@ -1156,9 +1159,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+		if (n_common_pathkeys < list_length(pathkeys))
 		{
 			Sort	   *sort = make_sort(subplan, numsortkeys,
+										 n_common_pathkeys,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1508,6 +1513,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 	Plan	   *subplan;
 	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	int			n_common_pathkeys;
 
 	/* As with Gather, it's best to project away columns in the workers. */
 	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1537,12 +1543,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
+	if (n_common_pathkeys < list_length(pathkeys))
+	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 n_common_pathkeys,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
+	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
@@ -1655,6 +1665,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1664,7 +1675,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
-	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
+	else
+		n_common_pathkeys = 0;
+
+	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+								   NULL, n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -1908,7 +1925,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
 											 new_grpColIdx,
-											 subplan);
+											 subplan,
+											 0);
 			}
 
 			if (!rollup->is_hashed)
@@ -3848,10 +3866,15 @@ create_mergejoin_plan(PlannerInfo *root,
 	 */
 	if (best_path->outersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		outer_relids = outer_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
-												   best_path->outersortkeys,
-												   outer_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+									best_path->jpath.outerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+									   outer_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3862,10 +3885,15 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	if (best_path->innersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		inner_relids = inner_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
-												   best_path->innersortkeys,
-												   inner_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+									best_path->jpath.innerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+									   inner_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -4927,8 +4955,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
+	int			skip_cols = 0;
+
+	if (IsA(plan, IncrementalSort))
+		skip_cols = ((IncrementalSort *) plan)->skipCols;
 
-	cost_sort(&sort_path, root, NIL,
+	cost_sort(&sort_path, root, NIL, skip_cols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -5519,13 +5552,31 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	/* Always use regular sort node when enable_incrementalsort = false */
+	if (!enable_incrementalsort)
+		skipCols = 0;
+
+	if (skipCols == 0)
+	{
+		node = makeNode(Sort);
+	}
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->skipCols = skipCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5858,9 +5909,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'skipCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int skipCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5880,7 +5933,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, skipCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5923,7 +5976,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5944,7 +5997,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 static Sort *
 make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree)
+						 Plan *lefttree,
+						 int skipCols)
 {
 	List	   *sub_tlist = lefttree->targetlist;
 	ListCell   *l;
@@ -5977,7 +6031,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, skipCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6633,6 +6687,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
 #include "parser/parse_clause.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b52dadd81..3842271245 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3849,14 +3849,14 @@ create_grouping_paths(PlannerInfo *root,
 			foreach(lc, input_rel->partial_pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
-				bool		is_sorted;
+				int			n_useful_pathkeys;
 
-				is_sorted = pathkeys_contained_in(root->group_pathkeys,
-												  path->pathkeys);
-				if (path == cheapest_partial_path || is_sorted)
+				n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
 				{
 					/* Sort the cheapest partial path, if it isn't already */
-					if (!is_sorted)
+					if (n_useful_pathkeys < list_length(root->group_pathkeys))
 						path = (Path *) create_sort_path(root,
 														 grouped_rel,
 														 path,
@@ -3929,14 +3929,14 @@ create_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
-			bool		is_sorted;
+			int			n_useful_pathkeys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
-			if (path == cheapest_path || is_sorted)
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
-				if (!is_sorted)
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -5003,13 +5003,13 @@ create_ordered_paths(PlannerInfo *root,
 	foreach(lc, input_rel->pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
-		bool		is_sorted;
+		int			n_useful_pathkeys;
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+														 path->pathkeys);
+		if (path == cheapest_input_path || n_useful_pathkeys > 0)
 		{
-			if (!is_sorted)
+			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -6139,8 +6139,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->reltarget->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 46367cba63..616ad1a474 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 5a08e75ad5..eb95ca4c5e 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -983,7 +983,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_path->startup_cost;
 	sorted_p.total_cost = input_path->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0, 
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_path->rows, input_path->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7df8761710..9c6f910f14 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
 }
 
 /*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
  *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
  *	  or more expensive than path2 for fetching the specified fraction
  *	  of the total tuples.
@@ -1356,12 +1356,13 @@ create_merge_append_path(PlannerInfo *root,
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
+		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
 
 		pathnode->path.rows += subpath->rows;
 		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
 			subpath->parallel_safe;
 
-		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (n_common_pathkeys == list_length(pathkeys))
 		{
 			/* Subpath is adequately ordered, we won't need to sort it */
 			input_startup_cost += subpath->startup_cost;
@@ -1375,6 +1376,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  n_common_pathkeys,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->pathtarget->width,
@@ -1622,7 +1625,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  subpath->pathtarget->width,
@@ -1715,6 +1719,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	GatherMergePath *pathnode = makeNode(GatherMergePath);
 	Cost		input_startup_cost = 0;
 	Cost		input_total_cost = 0;
+	int			n_common_pathkeys;
 
 	Assert(subpath->parallel_safe);
 	Assert(pathkeys);
@@ -1731,7 +1736,9 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.pathtarget = target ? target : rel->reltarget;
 	pathnode->path.rows += subpath->rows;
 
-	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+
+	if (n_common_pathkeys == list_length(pathkeys))
 	{
 		/* Subpath is adequately ordered, we won't need to sort it */
 		input_startup_cost += subpath->startup_cost;
@@ -1745,6 +1752,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		cost_sort(&sort_path,
 				  root,
 				  pathkeys,
+				  n_common_pathkeys,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  subpath->rows,
 				  subpath->pathtarget->width,
@@ -2604,9 +2613,31 @@ create_sort_path(PlannerInfo *root,
 				 List *pathkeys,
 				 double limit_tuples)
 {
-	SortPath   *pathnode = makeNode(SortPath);
+	SortPath   *pathnode;
+	int			n_common_pathkeys;
+
+	if (enable_incrementalsort)
+		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+	else
+		n_common_pathkeys = 0;
+
+	if (n_common_pathkeys == 0)
+	{
+		pathnode = makeNode(SortPath);
+		pathnode->path.pathtype = T_Sort;
+	}
+	else
+	{
+		IncrementalSortPath   *incpathnode;
+
+		incpathnode = makeNode(IncrementalSortPath);
+		pathnode = &incpathnode->spath;
+		pathnode->path.pathtype = T_IncrementalSort;
+		incpathnode->skipCols = n_common_pathkeys;
+	}
+
+	Assert(n_common_pathkeys < list_length(pathkeys));
 
-	pathnode->path.pathtype = T_Sort;
 	pathnode->path.parent = rel;
 	/* Sort doesn't project, so use source path's pathtarget */
 	pathnode->path.pathtarget = subpath->pathtarget;
@@ -2620,7 +2651,9 @@ create_sort_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 
-	cost_sort(&pathnode->path, root, pathkeys,
+	cost_sort(&pathnode->path, root,
+			  pathkeys, n_common_pathkeys,
+			  subpath->startup_cost,
 			  subpath->total_cost,
 			  subpath->rows,
 			  subpath->pathtarget->width,
@@ -2932,7 +2965,8 @@ create_groupingsets_path(PlannerInfo *root,
 			else
 			{
 				/* Account for cost of sort, but don't charge input cost again */
-				cost_sort(&sort_path, root, NIL,
+				cost_sort(&sort_path, root, NIL, 0,
+						  0.0,
 						  0.0,
 						  subpath->rows,
 						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 79dbfd1a05..e3e984b3da 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -291,7 +291,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortCollations,
 												   qstate->sortNullsFirsts,
 												   work_mem,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fcc8323f62..4726bee850 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3714,6 +3714,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 	return numdistinct;
 }
 
+/*
+ * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+ * 							  divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into.  Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+	ListCell   *l;
+	List	   *groupExprs = NIL;
+	double	   *result;
+	int			i;
+
+	/*
+	 * Get number of groups for each prefix of pathkeys.
+	 */
+	i = 0;
+	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+							linitial(key->pk_eclass->ec_members);
+
+		groupExprs = lappend(groupExprs, member->em_expr);
+
+		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+		i++;
+	}
+
+	return result;
+}
+
 /*
  * Estimate hash bucket statistics when the specified expression is used
  * as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 72f6be329e..bea4f00421 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -857,6 +857,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index eecc66cafa..0265da312b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -231,6 +231,13 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -573,6 +580,9 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -607,18 +617,27 @@ static Tuplesortstate *
 tuplesort_begin_common(int workMem, bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -636,7 +655,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -654,6 +673,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -694,13 +714,14 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess)
+					 int workMem, bool randomAccess,
+					 bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -742,7 +763,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -773,7 +794,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -864,7 +885,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -939,7 +960,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -981,7 +1002,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1092,16 +1113,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1160,7 +1177,98 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	if (delete)
+	{
+		MemoryContextDelete(state->maincontext);
+	}
+	else
+	{
+		MemoryContextResetOnly(state->sortcontext);
+		MemoryContextResetOnly(state->tuplecontext);
+	}
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	if (spaceUsed > state->maxSpace)
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state, false);
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2944,18 +3052,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..a9b562843d
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2a4f7407a1..4180f57e88 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1754,6 +1754,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "skip keys".
+ *	 SkipKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct SkipKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} SkipKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1782,6 +1796,44 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+	int64		groupsCount;	/* number of groups with equal skip keys */
+	TupleTableSlot *sampleSlot;	/* slot for sample tuple of sort group */
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2eb3d6d371..b6a9d6c597 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5f7b..033ec416fe 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -750,6 +750,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			skipCols;		/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 71689b8ed6..0d072fd7c3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1513,6 +1513,16 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			skipCols;
+} IncrementalSortPath;
+
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d2fff76653..45cfbee724 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern bool enable_indexonlyscan;
 extern bool enable_bitmapscan;
 extern bool enable_tidscan;
 extern bool enable_sort;
+extern bool enable_incrementalsort;
 extern bool enable_hashagg;
 extern bool enable_nestloop;
 extern bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 						 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0072b7aa0d..d6b8841d33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
@@ -226,6 +227,7 @@ extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
 extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
 							  List *mergeclauses,
 							  List *outer_pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
 extern List *truncate_useless_pathkeys(PlannerInfo *root,
 						  RelOptInfo *rel,
 						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
 extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
 					double input_rows, List **pgset);
 
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+										double tuples);
+
 extern void estimate_hash_bucket_stats(PlannerInfo *root,
 						   Node *hashkey, double nbuckets,
 						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5d57c503ab..9a5b7f8d3c 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -90,7 +90,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess);
+					 int workMem, bool randomAccess,
+					 bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem, bool randomAccess);
@@ -134,6 +135,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE:  drop cascades to table matest1
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
  {3,7,8,10,13,13,16,18,19,22}
 (3 rows)
 
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Merge Append
+   Sort Key: tenk1.thousand, tenk1.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+   ->  Incremental Sort
+         Sort Key: tenk1_1.thousand, tenk1_1.thousand
+         Presorted Key: tenk1_1.thousand
+         ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Merge Append
+   Sort Key: a.thousand, a.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+   ->  Incremental Sort
+         Sort Key: b.unique2, b.unique2
+         Presorted Key: b.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 --
 -- Check that constraint exclusion works correctly with partitions using
 -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index c9c8f51e1c..898361d6b3 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge         | on
  enable_hashagg             | on
  enable_hashjoin            | on
+ enable_incrementalsort     | on
  enable_indexonlyscan       | on
  enable_indexscan           | on
  enable_material            | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan             | on
  enable_sort                | on
  enable_tidscan             | on
-(15 rows)
+(16 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
@@ -607,9 +608,26 @@ SELECT
     ORDER BY f.i LIMIT 10)
 FROM generate_series(1, 3) g(i);
 
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 
 --
 -- Check that constraint exclusion works correctly with partitions using

#50

Antonin Houska

ah@cybertec.at

about 8 years ago

In reply to: Alexander Korotkov (#46)

Re: [HACKERS] [PATCH] Incremental sort

Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Antonin Houska <ah@cybertec.at> wrote:

Shouldn't the test contain *both* cases?

Thank you for pointing that. Sure, both cases are better. I've added second case as well as comments. Patch is attached.

I'm fine with the tests now but have a minor comment on this comment:

-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
-- can't perform top-N sort like local side can.

I think the note on LIMIT push-down makes the comment less clear because
there's no difference in processing the LIMIT: EXPLAIN shows that both

SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET
100 LIMIT 10;

and

SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET
100 LIMIT 10;

evaluate the LIMIT clause only locally.

What I consider the important difference is that the 2nd case does not
generate the appropriate input for remote incremental sort (while incremental
sort tends to be very cheap). Therefore it's cheaper to do no remote sort at
all and perform the top-N sort locally than to do a regular (non-incremental)
remote sort.

I have no other questions about this patch. I expect the CFM to set the status
to "ready for committer" as soon as the other reviewers confirm they're happy
about the patch status.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

#51

Alexander Korotkov

a.korotkov@postgrespro.ru

about 8 years ago

In reply to: Antonin Houska (#50)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On Mon, Jan 8, 2018 at 2:29 PM, Antonin Houska <ah@cybertec.at> wrote:

Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Antonin Houska <ah@cybertec.at> wrote:

Shouldn't the test contain *both* cases?

Thank you for pointing that. Sure, both cases are better. I've added

second case as well as comments. Patch is attached.

I'm fine with the tests now but have a minor comment on this comment:

-- CROSS JOIN, not pushed down, because we don't push down LIMIT and
remote side
-- can't perform top-N sort like local side can.

I think the note on LIMIT push-down makes the comment less clear because
there's no difference in processing the LIMIT: EXPLAIN shows that both

SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1
OFFSET
100 LIMIT 10;

and

SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3
OFFSET
100 LIMIT 10;

evaluate the LIMIT clause only locally.

What I consider the important difference is that the 2nd case does not
generate the appropriate input for remote incremental sort (while
incremental
sort tends to be very cheap). Therefore it's cheaper to do no remote sort
at
all and perform the top-N sort locally than to do a regular
(non-incremental)
remote sort.

Agree, these comments are not clear enough. I've rewritten comments: they
became much
more wordy, but now they look clearer for me. Also I've swapped the
queries order, for me
it seems to easier for understanding.

I have no other questions about this patch. I expect the CFM to set the
status
to "ready for committer" as soon as the other reviewers confirm they're
happy
about the patch status.

Good, thank you. Let's see what other reviewers will say.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-15.patchapplication/octet-stream; name=incremental-sort-15.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641fa7..80239faf21 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1979,28 +1979,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c705f..c324394942 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -508,7 +508,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e4a01699e4..fdcdc6683f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3553,6 +3553,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79e6985d0d..6cf5f8bad1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1011,6 +1015,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1611,6 +1618,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -1936,14 +1949,37 @@ static void
 show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 {
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+	int			skipCols;
+
+	if (IsA(plan, IncrementalSort))
+		skipCols = ((IncrementalSort *) plan)->skipCols;
+	else
+		skipCols = 0;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, skipCols, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->skipCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -1954,7 +1990,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -1978,7 +2014,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2047,7 +2083,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2104,7 +2140,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2117,13 +2153,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2163,9 +2200,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2373,6 +2414,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->groupsCount);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyLong("Sort Groups: %ld",
+								incrsortstate->groupsCount, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		groupsCount;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyLong("Sort Groups", groupsCount, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f8b72ebab9..490d6dd76c 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,10 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 		case T_SortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+			break;
 
 		default:
 			break;
@@ -976,6 +989,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1225,6 +1241,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 46ee880415..30855c3fe7 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -667,6 +667,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
+												  false,
 												  false);
 	}
 
@@ -754,7 +755,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, false);
+									 work_mem, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..a8e55e5e2d
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,646 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is a specially optimized kind of multikey sort used
+ *		when the input is already presorted by a prefix of the required keys
+ *		list.  Thus, when it's required to sort by (key1, key2 ... keyN) and
+ *		result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ *		where values of (key1, key2 ... keyM) are equal.
+ *
+ *		Consider the following example.  We have input tuples consisting from
+ *		two integers (x, y) already presorted by x, while it's required to
+ *		sort them by x and y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 10)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would sort by y following groups, which have
+ *		equal x, individually:
+ *			(1, 5) (1, 2)
+ *			(2, 10) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		following tuple set which is actually sorted by x and y.
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 10)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort is faster than full sort on large datasets.  But
+ *		the case of most huge benefit of incremental sort is queries with
+ *		LIMIT because incremental sort can return first tuples without reading
+ *		whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for skipKeys comparison.
+ */
+static void
+prepareSkipCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					skipCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	skipCols = plannode->skipCols;
+
+	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+	for (i = 0; i < skipCols; i++)
+	{
+		Oid equalityOp, equalityFunc;
+		SkipKeyData *key;
+
+		key = &node->skipKeys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check if first "skipCols" sort values are equal.
+ */
+static bool
+cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+															TupleTableSlot *b)
+{
+	int n, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+	for (i = 0; i < n; i++)
+	{
+		Datum datumA, datumB, result;
+		bool isnullA, isnullB;
+		AttrNumber attno = node->skipKeys[i].attno;
+		SkipKeyData *key;
+
+		datumA = slot_getattr(a, attno, &isnullA);
+		datumB = slot_getattr(b, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->skipKeys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Copying of tuples to the node->sampleSlot introduces some overhead.  It's
+ * especially notable when groups are containing one or few tuples.  In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from sorted set if any.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * If first time through, read all tuples from outer plan and pass them to
+	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	/*
+	 * Initialize tuplesort module.
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "calling tuplesort_begin");
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortSkipCols - already
+		 * sorted columns.
+		 */
+		prepareSkipCols(node);
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->groupsCount++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* Put saved tuple to tuplesort if any */
+	if (!TupIsNull(node->sampleSlot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+		ExecClearTuple(node->sampleSlot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where skipCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/* Put next group of presorted data to the tuplesort */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Save last tuple in minimal group */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->sampleSlot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/* Iterate while skip cols are the same as in saved tuple */
+			if (cmpSortSkipCols(node, node->sampleSlot, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->sampleSlot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+															node->groupsCount;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->sampleSlot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->groupsCount = 0;
+	incrsortstate->skipKeys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * tuple table initialization
+	 *
+	 * sort nodes only return scan tuples from their sorted relation.
+	 */
+	ExecInitResultTupleSlot(estate, &incrsortstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &incrsortstate->ss);
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * initialize tuple type.  no need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecAssignResultTypeFromTL(&incrsortstate->ss.ps);
+	ExecAssignScanTypeFromOuterPlan(&incrsortstate->ss);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortReInitializeDSM
+ *
+ *		Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	/* If there's any instrumentation space, clear it for next time */
+	if (node->shared_info != NULL)
+	{
+		memset(node->shared_info->sinfo, 0,
+			   node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 9c68de8565..90c82af17f 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  node->randomAccess);
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ddbbc79823..94d5ba0e41 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -919,6 +919,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -930,13 +948,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(skipCols);
 
 	return newnode;
 }
@@ -4817,6 +4851,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5e72df137e..415a9e9b19 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -870,12 +870,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -897,6 +895,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(skipCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3739,6 +3755,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9925866b53..9f64d50103 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2060,12 +2060,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2074,6 +2075,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(skipCols);
 
 	READ_DONE();
 }
@@ -2636,6 +2663,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 12a6ee4a22..e96c5fe137 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3613,6 +3613,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8679b14b29..fd0ba203d5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -121,6 +121,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1605,6 +1606,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  *	  Determines and returns the cost of sorting a relation, including
  *	  the cost of reading the input data.
  *
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys.  In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys.  And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly.  Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
  * comparisons for t tuples.
@@ -1631,7 +1639,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * work that has to be done to prepare the inputs to the comparison operators.
  *
  * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
@@ -1647,19 +1657,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
+	Cost		startup_cost = input_startup_cost;
+	Cost		run_cost = 0,
+				rest_cost,
+				group_cost,
+				input_run_cost = input_total_cost - input_startup_cost;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
+	double		num_groups,
+				group_input_bytes,
+				group_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
 	if (!enable_sort)
 		startup_cost += disable_cost;
+	if (!enable_incrementalsort)
+		presorted_keys = 0;
 
 	path->rows = tuples;
 
@@ -1685,13 +1704,50 @@ cost_sort(Path *path, PlannerInfo *root,
 		output_bytes = input_bytes;
 	}
 
-	if (output_bytes > sort_mem_bytes)
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	if (presorted_keys > 0)
+	{
+		List	   *presortedExprs = NIL;
+		ListCell   *l;
+		int			i = 0;
+
+		/* Extract presorted keys as list of expressions */
+		foreach(l, pathkeys)
+		{
+			PathKey *key = (PathKey *)lfirst(l);
+			EquivalenceMember *member = (EquivalenceMember *)
+										linitial(key->pk_eclass->ec_members);
+
+			presortedExprs = lappend(presortedExprs, member->em_expr);
+
+			i++;
+			if (i >= presorted_keys)
+				break;
+		}
+
+		/* Estimate number of groups with equal presorted keys */
+		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+	}
+	else
+	{
+		num_groups = 1.0;
+	}
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.
+	 */
+	group_input_bytes = input_bytes / num_groups;
+	group_tuples = tuples / num_groups;
+	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
-		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = input_bytes / sort_mem_bytes;
+		double		npages = ceil(group_input_bytes / BLCKSZ);
+		double		nruns = group_input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1701,7 +1757,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 
 		/* Disk costs */
 
@@ -1712,10 +1768,10 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		group_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
-	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1723,14 +1779,33 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
-		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		/*
+		 * We'll use plain quicksort on all the input tuples.  If it appears
+		 * that we expect less than two tuples per sort group then assume
+		 * logarithmic part of estimate to be 1.
+		 */
+		if (group_tuples >= 2.0)
+			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+		else
+			group_cost = comparison_cost * group_tuples;
 	}
 
+	/* Add per group cost of fetching tuples from input */
+	group_cost += input_run_cost / num_groups;
+
+	/*
+	 * We've to sort first group to start output from node. Sorting rest of
+	 * groups are required to return all the other tuples.
+	 */
+	startup_cost += group_cost;
+	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+	if (rest_cost > 0.0)
+		run_cost += rest_cost;
+
 	/*
 	 * Also charge a small amount (arbitrarily set equal to operator cost) per
 	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
@@ -1741,6 +1816,20 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	run_cost += cpu_operator_cost * tuples;
 
+	/* Extra costs of incremental sort */
+	if (presorted_keys > 0)
+	{
+		/*
+		 * In incremental sort case we also have to cost the detection of
+		 * sort groups.  This turns out to be one extra copy and comparison
+		 * per tuple.
+		 */
+		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+		/* Cost of per group tuplesort reset */
+		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+	}
+
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
@@ -2717,6 +2806,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->pathtarget->width,
@@ -2743,6 +2834,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ef58cff28d..329ba7b532 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,33 @@ compare_pathkeys(List *keys1, List *keys2)
 	return PATHKEYS_EQUAL;
 }
 
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int n;
+	ListCell   *key1,
+			   *key2;
+	n = 0;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+			return n;
+		n++;
+	}
+
+	return n;
+}
+
+
 /*
  * pathkeys_contained_in
  *	  Common special case of compare_pathkeys: we just want to know
@@ -1488,26 +1517,42 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
  */
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 {
-	if (root->query_pathkeys == NIL)
+	int	n_common_pathkeys;
+
+	if (query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
+
+	if (enable_incrementalsort)
 	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		/*
+		 * Return the number of path keys in common, or 0 if there are none. Any
+		 * first common pathkeys could be useful for ordering because we can use
+		 * incremental sort.
+		 */
+		return n_common_pathkeys;
+	}
+	else
+	{
+		/*
+		 * When incremental sort is disabled, pathkeys are useful only when they
+		 * do contain all the query pathkeys.
+		 */
+		if (n_common_pathkeys == list_length(query_pathkeys))
+			return n_common_pathkeys;
+		else
+			return 0;
 	}
-
-	return 0;					/* path ordering not useful */
 }
 
 /*
@@ -1523,7 +1568,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
 	int			nuseful2;
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
-	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283d6b..133435f516 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int skipCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree);
+						 Plan *lefttree,
+						 int skipCols);
 static Material *make_material(Plan *lefttree);
 static WindowAgg *make_windowagg(List *tlist, Index winref,
 			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -437,6 +438,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1122,6 +1124,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		Oid		   *sortOperators;
 		Oid		   *collations;
 		bool	   *nullsFirst;
+		int			n_common_pathkeys;
 
 		/* Build the child plan */
 		/* Must insist that all children return the same tlist */
@@ -1156,9 +1159,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+		if (n_common_pathkeys < list_length(pathkeys))
 		{
 			Sort	   *sort = make_sort(subplan, numsortkeys,
+										 n_common_pathkeys,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1508,6 +1513,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 	Plan	   *subplan;
 	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	int			n_common_pathkeys;
 
 	/* As with Gather, it's best to project away columns in the workers. */
 	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1537,12 +1543,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
+	if (n_common_pathkeys < list_length(pathkeys))
+	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 n_common_pathkeys,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
+	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
@@ -1655,6 +1665,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1664,7 +1675,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
-	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
+	else
+		n_common_pathkeys = 0;
+
+	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+								   NULL, n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -1908,7 +1925,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
 											 new_grpColIdx,
-											 subplan);
+											 subplan,
+											 0);
 			}
 
 			if (!rollup->is_hashed)
@@ -3848,10 +3866,15 @@ create_mergejoin_plan(PlannerInfo *root,
 	 */
 	if (best_path->outersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		outer_relids = outer_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
-												   best_path->outersortkeys,
-												   outer_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+									best_path->jpath.outerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+									   outer_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3862,10 +3885,15 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	if (best_path->innersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		inner_relids = inner_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
-												   best_path->innersortkeys,
-												   inner_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+									best_path->jpath.innerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+									   inner_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -4927,8 +4955,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
+	int			skip_cols = 0;
+
+	if (IsA(plan, IncrementalSort))
+		skip_cols = ((IncrementalSort *) plan)->skipCols;
 
-	cost_sort(&sort_path, root, NIL,
+	cost_sort(&sort_path, root, NIL, skip_cols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -5519,13 +5552,31 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	/* Always use regular sort node when enable_incrementalsort = false */
+	if (!enable_incrementalsort)
+		skipCols = 0;
+
+	if (skipCols == 0)
+	{
+		node = makeNode(Sort);
+	}
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->skipCols = skipCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5858,9 +5909,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'skipCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int skipCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5880,7 +5933,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, skipCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5923,7 +5976,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5944,7 +5997,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 static Sort *
 make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree)
+						 Plan *lefttree,
+						 int skipCols)
 {
 	List	   *sub_tlist = lefttree->targetlist;
 	ListCell   *l;
@@ -5977,7 +6031,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, skipCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6633,6 +6687,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
 #include "parser/parse_clause.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b52dadd81..3842271245 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3849,14 +3849,14 @@ create_grouping_paths(PlannerInfo *root,
 			foreach(lc, input_rel->partial_pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
-				bool		is_sorted;
+				int			n_useful_pathkeys;
 
-				is_sorted = pathkeys_contained_in(root->group_pathkeys,
-												  path->pathkeys);
-				if (path == cheapest_partial_path || is_sorted)
+				n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+				if (path == cheapest_partial_path || n_useful_pathkeys > 0)
 				{
 					/* Sort the cheapest partial path, if it isn't already */
-					if (!is_sorted)
+					if (n_useful_pathkeys < list_length(root->group_pathkeys))
 						path = (Path *) create_sort_path(root,
 														 grouped_rel,
 														 path,
@@ -3929,14 +3929,14 @@ create_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
-			bool		is_sorted;
+			int			n_useful_pathkeys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
-			if (path == cheapest_path || is_sorted)
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
-				if (!is_sorted)
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -5003,13 +5003,13 @@ create_ordered_paths(PlannerInfo *root,
 	foreach(lc, input_rel->pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
-		bool		is_sorted;
+		int			n_useful_pathkeys;
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+														 path->pathkeys);
+		if (path == cheapest_input_path || n_useful_pathkeys > 0)
 		{
-			if (!is_sorted)
+			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -6139,8 +6139,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->reltarget->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 46367cba63..616ad1a474 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 5a08e75ad5..eb95ca4c5e 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -983,7 +983,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_path->startup_cost;
 	sorted_p.total_cost = input_path->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0, 
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_path->rows, input_path->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7df8761710..9c6f910f14 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
 }
 
 /*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
  *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
  *	  or more expensive than path2 for fetching the specified fraction
  *	  of the total tuples.
@@ -1356,12 +1356,13 @@ create_merge_append_path(PlannerInfo *root,
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
+		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
 
 		pathnode->path.rows += subpath->rows;
 		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
 			subpath->parallel_safe;
 
-		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (n_common_pathkeys == list_length(pathkeys))
 		{
 			/* Subpath is adequately ordered, we won't need to sort it */
 			input_startup_cost += subpath->startup_cost;
@@ -1375,6 +1376,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  n_common_pathkeys,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->pathtarget->width,
@@ -1622,7 +1625,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  subpath->pathtarget->width,
@@ -1715,6 +1719,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	GatherMergePath *pathnode = makeNode(GatherMergePath);
 	Cost		input_startup_cost = 0;
 	Cost		input_total_cost = 0;
+	int			n_common_pathkeys;
 
 	Assert(subpath->parallel_safe);
 	Assert(pathkeys);
@@ -1731,7 +1736,9 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.pathtarget = target ? target : rel->reltarget;
 	pathnode->path.rows += subpath->rows;
 
-	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+
+	if (n_common_pathkeys == list_length(pathkeys))
 	{
 		/* Subpath is adequately ordered, we won't need to sort it */
 		input_startup_cost += subpath->startup_cost;
@@ -1745,6 +1752,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		cost_sort(&sort_path,
 				  root,
 				  pathkeys,
+				  n_common_pathkeys,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  subpath->rows,
 				  subpath->pathtarget->width,
@@ -2604,9 +2613,31 @@ create_sort_path(PlannerInfo *root,
 				 List *pathkeys,
 				 double limit_tuples)
 {
-	SortPath   *pathnode = makeNode(SortPath);
+	SortPath   *pathnode;
+	int			n_common_pathkeys;
+
+	if (enable_incrementalsort)
+		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+	else
+		n_common_pathkeys = 0;
+
+	if (n_common_pathkeys == 0)
+	{
+		pathnode = makeNode(SortPath);
+		pathnode->path.pathtype = T_Sort;
+	}
+	else
+	{
+		IncrementalSortPath   *incpathnode;
+
+		incpathnode = makeNode(IncrementalSortPath);
+		pathnode = &incpathnode->spath;
+		pathnode->path.pathtype = T_IncrementalSort;
+		incpathnode->skipCols = n_common_pathkeys;
+	}
+
+	Assert(n_common_pathkeys < list_length(pathkeys));
 
-	pathnode->path.pathtype = T_Sort;
 	pathnode->path.parent = rel;
 	/* Sort doesn't project, so use source path's pathtarget */
 	pathnode->path.pathtarget = subpath->pathtarget;
@@ -2620,7 +2651,9 @@ create_sort_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 
-	cost_sort(&pathnode->path, root, pathkeys,
+	cost_sort(&pathnode->path, root,
+			  pathkeys, n_common_pathkeys,
+			  subpath->startup_cost,
 			  subpath->total_cost,
 			  subpath->rows,
 			  subpath->pathtarget->width,
@@ -2932,7 +2965,8 @@ create_groupingsets_path(PlannerInfo *root,
 			else
 			{
 				/* Account for cost of sort, but don't charge input cost again */
-				cost_sort(&sort_path, root, NIL,
+				cost_sort(&sort_path, root, NIL, 0,
+						  0.0,
 						  0.0,
 						  subpath->rows,
 						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 79dbfd1a05..e3e984b3da 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -291,7 +291,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortCollations,
 												   qstate->sortNullsFirsts,
 												   work_mem,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fcc8323f62..4726bee850 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3714,6 +3714,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 	return numdistinct;
 }
 
+/*
+ * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+ * 							  divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into.  Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+	ListCell   *l;
+	List	   *groupExprs = NIL;
+	double	   *result;
+	int			i;
+
+	/*
+	 * Get number of groups for each prefix of pathkeys.
+	 */
+	i = 0;
+	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+							linitial(key->pk_eclass->ec_members);
+
+		groupExprs = lappend(groupExprs, member->em_expr);
+
+		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+		i++;
+	}
+
+	return result;
+}
+
 /*
  * Estimate hash bucket statistics when the specified expression is used
  * as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 72f6be329e..bea4f00421 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -857,6 +857,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index eecc66cafa..0265da312b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -231,6 +231,13 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -573,6 +580,9 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -607,18 +617,27 @@ static Tuplesortstate *
 tuplesort_begin_common(int workMem, bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -636,7 +655,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -654,6 +673,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -694,13 +714,14 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess)
+					 int workMem, bool randomAccess,
+					 bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -742,7 +763,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -773,7 +794,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -864,7 +885,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -939,7 +960,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -981,7 +1002,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1092,16 +1113,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1160,7 +1177,98 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	if (delete)
+	{
+		MemoryContextDelete(state->maincontext);
+	}
+	else
+	{
+		MemoryContextResetOnly(state->sortcontext);
+		MemoryContextResetOnly(state->tuplecontext);
+	}
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	if (spaceUsed > state->maxSpace)
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state, false);
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2944,18 +3052,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..a9b562843d
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2a4f7407a1..4180f57e88 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1754,6 +1754,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "skip keys".
+ *	 SkipKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct SkipKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} SkipKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1782,6 +1796,44 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+	int64		groupsCount;	/* number of groups with equal skip keys */
+	TupleTableSlot *sampleSlot;	/* slot for sample tuple of sort group */
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2eb3d6d371..b6a9d6c597 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5f7b..033ec416fe 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -750,6 +750,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			skipCols;		/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 71689b8ed6..0d072fd7c3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1513,6 +1513,16 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			skipCols;
+} IncrementalSortPath;
+
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d2fff76653..45cfbee724 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern bool enable_indexonlyscan;
 extern bool enable_bitmapscan;
 extern bool enable_tidscan;
 extern bool enable_sort;
+extern bool enable_incrementalsort;
 extern bool enable_hashagg;
 extern bool enable_nestloop;
 extern bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 						 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0072b7aa0d..d6b8841d33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
@@ -226,6 +227,7 @@ extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
 extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
 							  List *mergeclauses,
 							  List *outer_pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
 extern List *truncate_useless_pathkeys(PlannerInfo *root,
 						  RelOptInfo *rel,
 						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
 extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
 					double input_rows, List **pgset);
 
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+										double tuples);
+
 extern void estimate_hash_bucket_stats(PlannerInfo *root,
 						   Node *hashkey, double nbuckets,
 						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5d57c503ab..9a5b7f8d3c 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -90,7 +90,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess);
+					 int workMem, bool randomAccess,
+					 bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem, bool randomAccess);
@@ -134,6 +135,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE:  drop cascades to table matest1
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
  {3,7,8,10,13,13,16,18,19,22}
 (3 rows)
 
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Merge Append
+   Sort Key: tenk1.thousand, tenk1.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+   ->  Incremental Sort
+         Sort Key: tenk1_1.thousand, tenk1_1.thousand
+         Presorted Key: tenk1_1.thousand
+         ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Merge Append
+   Sort Key: a.thousand, a.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+   ->  Incremental Sort
+         Sort Key: b.unique2, b.unique2
+         Presorted Key: b.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 --
 -- Check that constraint exclusion works correctly with partitions using
 -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index c9c8f51e1c..898361d6b3 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge         | on
  enable_hashagg             | on
  enable_hashjoin            | on
+ enable_incrementalsort     | on
  enable_indexonlyscan       | on
  enable_indexscan           | on
  enable_material            | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan             | on
  enable_sort                | on
  enable_tidscan             | on
-(15 rows)
+(16 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
@@ -607,9 +608,26 @@ SELECT
     ORDER BY f.i LIMIT 10)
 FROM generate_series(1, 3) g(i);
 
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 
 --
 -- Check that constraint exclusion works correctly with partitions using

#52

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Alexander Korotkov (#51)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On Mon, Jan 8, 2018 at 10:17 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

I have no other questions about this patch. I expect the CFM to set the

status
to "ready for committer" as soon as the other reviewers confirm they're
happy
about the patch status.

Good, thank you. Let's see what other reviewers will say.

Rebased patch is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-16.patchapplication/octet-stream; name=incremental-sort-16.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 08b30f83e0..669fc82a75 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1997,28 +1997,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 7f4d0dab25..0c55c761e9 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -511,7 +511,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 00fc364c0a..2596ebe595 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3627,6 +3627,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 900fa74e85..8246a95bfb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1014,6 +1018,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1614,6 +1621,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -1939,14 +1952,37 @@ static void
 show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 {
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+	int			skipCols;
+
+	if (IsA(plan, IncrementalSort))
+		skipCols = ((IncrementalSort *) plan)->skipCols;
+	else
+		skipCols = 0;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, skipCols, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->skipCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -1957,7 +1993,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -1981,7 +2017,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2050,7 +2086,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2107,7 +2143,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2120,13 +2156,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2166,9 +2203,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2376,6 +2417,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->groupsCount);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyLong("Sort Groups: %ld",
+								incrsortstate->groupsCount, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		groupsCount;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyLong("Sort Groups", groupsCount, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 14b0b89463..774cfb69d7 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,10 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 		case T_SortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortReInitializeDSM((IncrementalSortState *) planstate, pcxt);
+			break;
 
 		default:
 			break;
@@ -976,6 +989,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1225,6 +1241,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 1b1334006f..77013909a8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -373,7 +373,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
-												  NULL, false);
+												  NULL, false, false);
 	}
 
 	aggstate->current_phase = newphase;
@@ -460,7 +460,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, false);
+									 work_mem, NULL, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..dc9e6d7cf7
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,643 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is a specially optimized kind of multikey sort used
+ *		when the input is already presorted by a prefix of the required keys
+ *		list.  Thus, when it's required to sort by (key1, key2 ... keyN) and
+ *		result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ *		where values of (key1, key2 ... keyM) are equal.
+ *
+ *		Consider the following example.  We have input tuples consisting from
+ *		two integers (x, y) already presorted by x, while it's required to
+ *		sort them by x and y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 10)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would sort by y following groups, which have
+ *		equal x, individually:
+ *			(1, 5) (1, 2)
+ *			(2, 10) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		following tuple set which is actually sorted by x and y.
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 10)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort is faster than full sort on large datasets.  But
+ *		the case of most huge benefit of incremental sort is queries with
+ *		LIMIT because incremental sort can return first tuples without reading
+ *		whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for skipKeys comparison.
+ */
+static void
+prepareSkipCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					skipCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	skipCols = plannode->skipCols;
+
+	node->skipKeys = (SkipKeyData *) palloc(skipCols * sizeof(SkipKeyData));
+
+	for (i = 0; i < skipCols; i++)
+	{
+		Oid equalityOp, equalityFunc;
+		SkipKeyData *key;
+
+		key = &node->skipKeys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check if first "skipCols" sort values are equal.
+ */
+static bool
+cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+															TupleTableSlot *b)
+{
+	int n, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+	for (i = 0; i < n; i++)
+	{
+		Datum datumA, datumB, result;
+		bool isnullA, isnullB;
+		AttrNumber attno = node->skipKeys[i].attno;
+		SkipKeyData *key;
+
+		datumA = slot_getattr(a, attno, &isnullA);
+		datumB = slot_getattr(b, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->skipKeys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Copying of tuples to the node->sampleSlot introduces some overhead.  It's
+ * especially notable when groups are containing one or few tuples.  In order
+ * to cope this problem we don't copy sample tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from sorted set if any.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * If first time through, read all tuples from outer plan and pass them to
+	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	/*
+	 * Initialize tuplesort module.
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "calling tuplesort_begin");
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortSkipCols - already
+		 * sorted columns.
+		 */
+		prepareSkipCols(node);
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->groupsCount++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* Put saved tuple to tuplesort if any */
+	if (!TupIsNull(node->sampleSlot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->sampleSlot);
+		ExecClearTuple(node->sampleSlot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where skipCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/* Put next group of presorted data to the tuplesort */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Save last tuple in minimal group */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->sampleSlot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/* Iterate while skip cols are the same as in saved tuple */
+			if (cmpSortSkipCols(node, node->sampleSlot, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->sampleSlot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+															node->groupsCount;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->sampleSlot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->groupsCount = 0;
+	incrsortstate->skipKeys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->sampleSlot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->sampleSlot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortReInitializeDSM
+ *
+ *		Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	/* If there's any instrumentation space, clear it for next time */
+	if (node->shared_info != NULL)
+	{
+		memset(node->shared_info->sinfo, 0,
+			   node->shared_info->num_workers * sizeof(IncrementalSortInfo));
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..457e774b3d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,9 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 266a3ef8ef..0c9862da75 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -920,6 +920,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -931,13 +949,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(skipCols);
 
 	return newnode;
 }
@@ -4831,6 +4865,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 011d2a3fa9..116dcc937f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -876,12 +876,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -903,6 +901,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(skipCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3754,6 +3770,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 068db353d7..ddb658b5df 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2066,12 +2066,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2080,6 +2081,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(skipCols);
 
 	READ_DONE();
 }
@@ -2647,6 +2674,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1c792a00eb..c546dc8862 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3624,6 +3624,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d8db0b29e1..730e69f313 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1614,6 +1615,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  *	  Determines and returns the cost of sorting a relation, including
  *	  the cost of reading the input data.
  *
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys.  In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys.  And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly.  Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
  * comparisons for t tuples.
@@ -1640,7 +1648,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * work that has to be done to prepare the inputs to the comparison operators.
  *
  * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
@@ -1656,19 +1666,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
+	Cost		startup_cost = input_startup_cost;
+	Cost		run_cost = 0,
+				rest_cost,
+				group_cost,
+				input_run_cost = input_total_cost - input_startup_cost;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
+	double		num_groups,
+				group_input_bytes,
+				group_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
 	if (!enable_sort)
 		startup_cost += disable_cost;
+	if (!enable_incrementalsort)
+		presorted_keys = 0;
 
 	path->rows = tuples;
 
@@ -1694,13 +1713,50 @@ cost_sort(Path *path, PlannerInfo *root,
 		output_bytes = input_bytes;
 	}
 
-	if (output_bytes > sort_mem_bytes)
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	if (presorted_keys > 0)
+	{
+		List	   *presortedExprs = NIL;
+		ListCell   *l;
+		int			i = 0;
+
+		/* Extract presorted keys as list of expressions */
+		foreach(l, pathkeys)
+		{
+			PathKey *key = (PathKey *)lfirst(l);
+			EquivalenceMember *member = (EquivalenceMember *)
+										linitial(key->pk_eclass->ec_members);
+
+			presortedExprs = lappend(presortedExprs, member->em_expr);
+
+			i++;
+			if (i >= presorted_keys)
+				break;
+		}
+
+		/* Estimate number of groups with equal presorted keys */
+		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+	}
+	else
+	{
+		num_groups = 1.0;
+	}
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.
+	 */
+	group_input_bytes = input_bytes / num_groups;
+	group_tuples = tuples / num_groups;
+	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
-		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = input_bytes / sort_mem_bytes;
+		double		npages = ceil(group_input_bytes / BLCKSZ);
+		double		nruns = group_input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1710,7 +1766,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 
 		/* Disk costs */
 
@@ -1721,10 +1777,10 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		group_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
-	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1732,14 +1788,33 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
-		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		/*
+		 * We'll use plain quicksort on all the input tuples.  If it appears
+		 * that we expect less than two tuples per sort group then assume
+		 * logarithmic part of estimate to be 1.
+		 */
+		if (group_tuples >= 2.0)
+			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+		else
+			group_cost = comparison_cost * group_tuples;
 	}
 
+	/* Add per group cost of fetching tuples from input */
+	group_cost += input_run_cost / num_groups;
+
+	/*
+	 * We've to sort first group to start output from node. Sorting rest of
+	 * groups are required to return all the other tuples.
+	 */
+	startup_cost += group_cost;
+	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+	if (rest_cost > 0.0)
+		run_cost += rest_cost;
+
 	/*
 	 * Also charge a small amount (arbitrarily set equal to operator cost) per
 	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
@@ -1750,6 +1825,20 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	run_cost += cpu_operator_cost * tuples;
 
+	/* Extra costs of incremental sort */
+	if (presorted_keys > 0)
+	{
+		/*
+		 * In incremental sort case we also have to cost the detection of
+		 * sort groups.  This turns out to be one extra copy and comparison
+		 * per tuple.
+		 */
+		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+		/* Cost of per group tuplesort reset */
+		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+	}
+
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
@@ -2727,6 +2816,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->pathtarget->width,
@@ -2753,6 +2844,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->pathtarget->width,
@@ -2989,18 +3082,17 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
 	 * inner path is to be used directly (without sorting) and it doesn't
 	 * support mark/restore.
 	 *
-	 * Since the inner side must be ordered, and only Sorts and IndexScans can
-	 * create order to begin with, and they both support mark/restore, you
-	 * might think there's no problem --- but you'd be wrong.  Nestloop and
-	 * merge joins can *preserve* the order of their inputs, so they can be
-	 * selected as the input of a mergejoin, and they don't support
-	 * mark/restore at present.
+	 * Sorts and IndexScans support mark/restore, but IncrementalSorts don't.
+	 * Also Nestloop and merge joins can *preserve* the order of their inputs,
+	 * so they can be selected as the input of a mergejoin, and they don't
+	 * support mark/restore at present.
 	 *
 	 * We don't test the value of enable_material here, because
 	 * materialization is required for correctness in this case, and turning
 	 * it off does not entitle us to deliver an invalid plan.
 	 */
-	else if (innersortkeys == NIL &&
+	else if ((innersortkeys == NIL ||
+			  pathkeys_common(innersortkeys, inner_path->pathkeys) > 0) &&
 			 !ExecSupportsMarkRestore(inner_path))
 		path->materialize_inner = true;
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..cf980ac590 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,33 @@ compare_pathkeys(List *keys1, List *keys2)
 	return PATHKEYS_EQUAL;
 }
 
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int n;
+	ListCell   *key1,
+			   *key2;
+	n = 0;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+			return n;
+		n++;
+	}
+
+	return n;
+}
+
+
 /*
  * pathkeys_contained_in
  *	  Common special case of compare_pathkeys: we just want to know
@@ -1580,26 +1609,42 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
  */
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 {
-	if (root->query_pathkeys == NIL)
+	int	n_common_pathkeys;
+
+	if (query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	n_common_pathkeys = pathkeys_common(query_pathkeys, pathkeys);
+
+	if (enable_incrementalsort)
 	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		/*
+		 * Return the number of path keys in common, or 0 if there are none. Any
+		 * first common pathkeys could be useful for ordering because we can use
+		 * incremental sort.
+		 */
+		return n_common_pathkeys;
+	}
+	else
+	{
+		/*
+		 * When incremental sort is disabled, pathkeys are useful only when they
+		 * do contain all the query pathkeys.
+		 */
+		if (n_common_pathkeys == list_length(query_pathkeys))
+			return n_common_pathkeys;
+		else
+			return 0;
 	}
-
-	return 0;					/* path ordering not useful */
 }
 
 /*
@@ -1615,7 +1660,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
 	int			nuseful2;
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
-	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9ae1bf31d5..e7529b6c04 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int skipCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree);
+						 Plan *lefttree,
+						 int skipCols);
 static Material *make_material(Plan *lefttree);
 static WindowAgg *make_windowagg(List *tlist, Index winref,
 			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -443,6 +444,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1128,6 +1130,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		Oid		   *sortOperators;
 		Oid		   *collations;
 		bool	   *nullsFirst;
+		int			n_common_pathkeys;
 
 		/* Build the child plan */
 		/* Must insist that all children return the same tlist */
@@ -1162,9 +1165,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+		if (n_common_pathkeys < list_length(pathkeys))
 		{
 			Sort	   *sort = make_sort(subplan, numsortkeys,
+										 n_common_pathkeys,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1514,6 +1519,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 	Plan	   *subplan;
 	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	int			n_common_pathkeys;
 
 	/* As with Gather, it's best to project away columns in the workers. */
 	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1543,12 +1549,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+	n_common_pathkeys = pathkeys_common(pathkeys, best_path->subpath->pathkeys);
+	if (n_common_pathkeys < list_length(pathkeys))
+	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 n_common_pathkeys,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
+	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
@@ -1661,6 +1671,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1670,7 +1681,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
-	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->skipCols;
+	else
+		n_common_pathkeys = 0;
+
+	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+								   NULL, n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -1914,7 +1931,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
 											 new_grpColIdx,
-											 subplan);
+											 subplan,
+											 0);
 			}
 
 			if (!rollup->is_hashed)
@@ -3862,10 +3880,15 @@ create_mergejoin_plan(PlannerInfo *root,
 	 */
 	if (best_path->outersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		outer_relids = outer_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
-												   best_path->outersortkeys,
-												   outer_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+									best_path->jpath.outerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+									   outer_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3876,10 +3899,15 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	if (best_path->innersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		inner_relids = inner_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
-												   best_path->innersortkeys,
-												   inner_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+									best_path->jpath.innerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+									   inner_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -4934,8 +4962,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
+	int			skip_cols = 0;
+
+	if (IsA(plan, IncrementalSort))
+		skip_cols = ((IncrementalSort *) plan)->skipCols;
 
-	cost_sort(&sort_path, root, NIL,
+	cost_sort(&sort_path, root, NIL, skip_cols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -5526,13 +5559,31 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	/* Always use regular sort node when enable_incrementalsort = false */
+	if (!enable_incrementalsort)
+		skipCols = 0;
+
+	if (skipCols == 0)
+	{
+		node = makeNode(Sort);
+	}
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->skipCols = skipCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5865,9 +5916,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'skipCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int skipCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5887,7 +5940,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, skipCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5930,7 +5983,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5951,7 +6004,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 static Sort *
 make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree)
+						 Plan *lefttree,
+						 int skipCols)
 {
 	List	   *sub_tlist = lefttree->targetlist;
 	ListCell   *l;
@@ -5984,7 +6038,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, skipCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6649,6 +6703,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
 #include "parser/parse_clause.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index de1257d9c2..496024cb16 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4650,13 +4650,13 @@ create_ordered_paths(PlannerInfo *root,
 	foreach(lc, input_rel->pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
-		bool		is_sorted;
+		int			n_useful_pathkeys;
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+														 path->pathkeys);
+		if (path == cheapest_input_path || n_useful_pathkeys > 0)
 		{
-			if (!is_sorted)
+			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -5786,8 +5786,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->reltarget->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
@@ -6023,14 +6024,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
-			bool		is_sorted;
+			int			n_useful_pathkeys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
-			if (path == cheapest_path || is_sorted)
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+									root->group_pathkeys, path->pathkeys);
+			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
-				if (!is_sorted)
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -6092,21 +6093,24 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, partially_grouped_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			int			n_useful_pathkeys;
 
 			/*
 			 * Insert a Sort node, if required.  But there's no point in
-			 * sorting anything but the cheapest path.
+			 * non-incremental sorting anything but the cheapest path.
 			 */
-			if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
-			{
-				if (path != partially_grouped_rel->cheapest_total_path)
-					continue;
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+			if (n_useful_pathkeys == 0 &&
+				path != partially_grouped_rel->cheapest_total_path)
+				continue;
+
+			if (n_useful_pathkeys < list_length(root->group_pathkeys))
 				path = (Path *) create_sort_path(root,
 												 grouped_rel,
 												 path,
 												 root->group_pathkeys,
 												 -1.0);
-			}
 
 			if (parse->hasAggs)
 				add_path(grouped_rel, (Path *)
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 46367cba63..616ad1a474 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index b586f941a8..3bce376e38 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -987,7 +987,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_path->startup_cost;
 	sorted_p.total_cost = input_path->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0, 
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_path->rows, input_path->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fe3b4582d4..b411a70015 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
 }
 
 /*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
  *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
  *	  or more expensive than path2 for fetching the specified fraction
  *	  of the total tuples.
@@ -1362,12 +1362,13 @@ create_merge_append_path(PlannerInfo *root,
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
+		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
 
 		pathnode->path.rows += subpath->rows;
 		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
 			subpath->parallel_safe;
 
-		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (n_common_pathkeys == list_length(pathkeys))
 		{
 			/* Subpath is adequately ordered, we won't need to sort it */
 			input_startup_cost += subpath->startup_cost;
@@ -1381,6 +1382,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  n_common_pathkeys,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->pathtarget->width,
@@ -1628,7 +1631,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  subpath->pathtarget->width,
@@ -1721,6 +1725,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	GatherMergePath *pathnode = makeNode(GatherMergePath);
 	Cost		input_startup_cost = 0;
 	Cost		input_total_cost = 0;
+	int			n_common_pathkeys;
 
 	Assert(subpath->parallel_safe);
 	Assert(pathkeys);
@@ -1737,7 +1742,9 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.pathtarget = target ? target : rel->reltarget;
 	pathnode->path.rows += subpath->rows;
 
-	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+	n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+
+	if (n_common_pathkeys == list_length(pathkeys))
 	{
 		/* Subpath is adequately ordered, we won't need to sort it */
 		input_startup_cost += subpath->startup_cost;
@@ -1751,6 +1758,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		cost_sort(&sort_path,
 				  root,
 				  pathkeys,
+				  n_common_pathkeys,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  subpath->rows,
 				  subpath->pathtarget->width,
@@ -2610,9 +2619,31 @@ create_sort_path(PlannerInfo *root,
 				 List *pathkeys,
 				 double limit_tuples)
 {
-	SortPath   *pathnode = makeNode(SortPath);
+	SortPath   *pathnode;
+	int			n_common_pathkeys;
+
+	if (enable_incrementalsort)
+		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+	else
+		n_common_pathkeys = 0;
+
+	if (n_common_pathkeys == 0)
+	{
+		pathnode = makeNode(SortPath);
+		pathnode->path.pathtype = T_Sort;
+	}
+	else
+	{
+		IncrementalSortPath   *incpathnode;
+
+		incpathnode = makeNode(IncrementalSortPath);
+		pathnode = &incpathnode->spath;
+		pathnode->path.pathtype = T_IncrementalSort;
+		incpathnode->skipCols = n_common_pathkeys;
+	}
+
+	Assert(n_common_pathkeys < list_length(pathkeys));
 
-	pathnode->path.pathtype = T_Sort;
 	pathnode->path.parent = rel;
 	/* Sort doesn't project, so use source path's pathtarget */
 	pathnode->path.pathtarget = subpath->pathtarget;
@@ -2626,7 +2657,9 @@ create_sort_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 
-	cost_sort(&pathnode->path, root, pathkeys,
+	cost_sort(&pathnode->path, root,
+			  pathkeys, n_common_pathkeys,
+			  subpath->startup_cost,
 			  subpath->total_cost,
 			  subpath->rows,
 			  subpath->pathtarget->width,
@@ -2938,7 +2971,8 @@ create_groupingsets_path(PlannerInfo *root,
 			else
 			{
 				/* Account for cost of sort, but don't charge input cost again */
-				cost_sort(&sort_path, root, NIL,
+				cost_sort(&sort_path, root, NIL, 0,
+						  0.0,
 						  0.0,
 						  subpath->rows,
 						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 50b34fcbc6..0b5ce4be45 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -295,7 +295,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortNullsFirsts,
 												   work_mem,
 												   NULL,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fcc8323f62..4726bee850 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3714,6 +3714,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 	return numdistinct;
 }
 
+/*
+ * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+ * 							  divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into.  Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+	ListCell   *l;
+	List	   *groupExprs = NIL;
+	double	   *result;
+	int			i;
+
+	/*
+	 * Get number of groups for each prefix of pathkeys.
+	 */
+	i = 0;
+	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+							linitial(key->pk_eclass->ec_members);
+
+		groupExprs = lappend(groupExprs, member->em_expr);
+
+		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+		i++;
+	}
+
+	return result;
+}
+
 /*
  * Estimate hash bucket statistics when the specified expression is used
  * as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1db7845d5a..44a30c2430 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -859,6 +859,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 041bdc2fa7..fb17b4f1c5 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -243,6 +243,13 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +654,9 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +692,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +702,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +734,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +759,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -807,14 +827,15 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate, bool randomAccess)
+					 int workMem, SortCoordinate coordinate,
+					 bool randomAccess, bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -857,7 +878,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -890,7 +911,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1006,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1085,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1128,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1245,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1311,98 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	if (delete)
+	{
+		MemoryContextDelete(state->maincontext);
+	}
+	else
+	{
+		MemoryContextResetOnly(state->sortcontext);
+		MemoryContextResetOnly(state->tuplecontext);
+	}
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	if (spaceUsed > state->maxSpace)
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state, false);
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -3137,18 +3245,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..a9b562843d
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortReInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a953820f43..bc158677b1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1764,6 +1764,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "skip keys".
+ *	 SkipKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct SkipKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} SkipKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1792,6 +1806,44 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+	int64		groupsCount;	/* number of groups with equal skip keys */
+	TupleTableSlot *sampleSlot;	/* slot for sample tuple of sort group */
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 74b094a9c3..133bb17bdc 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2e19eae68..e29a312d4a 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -751,6 +751,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			skipCols;		/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index d576aa7350..9d266888a4 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1519,6 +1519,16 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			skipCols;
+} IncrementalSortPath;
+
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 132e35551b..00f0205be4 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 						 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 94f9bb2b57..8eaa1bd816 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,7 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
@@ -229,6 +230,7 @@ extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
 extern List *trim_mergeclauses_for_inner_pathkeys(PlannerInfo *root,
 									 List *mergeclauses,
 									 List *pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
 extern List *truncate_useless_pathkeys(PlannerInfo *root,
 						  RelOptInfo *rel,
 						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
 extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
 					double input_rows, List **pgset);
 
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+										double tuples);
+
 extern void estimate_hash_bucket_stats(PlannerInfo *root,
 						   Node *hashkey, double nbuckets,
 						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..eb260dfd8b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -193,7 +193,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
 					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 bool randomAccess, bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE:  drop cascades to table matest1
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
  {3,7,8,10,13,13,16,18,19,22}
 (3 rows)
 
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Merge Append
+   Sort Key: tenk1.thousand, tenk1.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+   ->  Incremental Sort
+         Sort Key: tenk1_1.thousand, tenk1_1.thousand
+         Presorted Key: tenk1_1.thousand
+         ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Merge Append
+   Sort Key: a.thousand, a.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+   ->  Incremental Sort
+         Sort Key: b.unique2, b.unique2
+         Presorted Key: b.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 --
 -- Check that constraint exclusion works correctly with partitions using
 -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 4d5931d67e..cec3b22fb5 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2347,18 +2347,21 @@ select count(*) from
   left join
   (select * from tenk1 y order by y.unique2) y
   on x.thousand = y.unique2 and x.twothousand = y.hundred and x.fivethous = y.unique2;
-                                    QUERY PLAN                                    
-----------------------------------------------------------------------------------
+                                                  QUERY PLAN                                                  
+--------------------------------------------------------------------------------------------------------------
  Aggregate
    ->  Merge Left Join
-         Merge Cond: (x.thousand = y.unique2)
-         Join Filter: ((x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
+         Merge Cond: ((x.thousand = y.unique2) AND (x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
          ->  Sort
                Sort Key: x.thousand, x.twothousand, x.fivethous
                ->  Seq Scan on tenk1 x
          ->  Materialize
-               ->  Index Scan using tenk1_unique2 on tenk1 y
-(9 rows)
+               ->  Incremental Sort
+                     Sort Key: y.unique2, y.hundred
+                     Presorted Key: y.unique2
+                     ->  Subquery Scan on y
+                           ->  Index Scan using tenk1_unique2 on tenk1 y_1
+(12 rows)
 
 select count(*) from
   (select * from tenk1 x order by x.thousand, x.twothousand, x.fivethous) x
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 4fccd9ae54..e0290977f1 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -935,10 +935,12 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
                                     QUERY PLAN                                    
 ----------------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: t1.a, t2.b, ((t3.a + t3.b))
+   Presorted Key: t1.a
    ->  Result
-         ->  Append
+         ->  Merge Append
+               Sort Key: t1.a
                ->  Merge Left Join
                      Merge Cond: (t1.a = t2.b)
                      ->  Sort
@@ -987,7 +989,7 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
                      ->  Sort
                            Sort Key: t2_2.b
                            ->  Seq Scan on prt2_p3 t2_2
-(52 rows)
+(54 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
   a  |  c   |  b  |  c   | ?column? | c 
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 759f7d9d59..f855214099 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge        | on
  enable_hashagg            | on
  enable_hashjoin           | on
+ enable_incrementalsort    | on
  enable_indexonlyscan      | on
  enable_indexscan          | on
  enable_material           | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan            | on
  enable_sort               | on
  enable_tidscan            | on
-(15 rows)
+(16 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
@@ -607,9 +608,26 @@ SELECT
     ORDER BY f.i LIMIT 10)
 FROM generate_series(1, 3) g(i);
 
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 
 --
 -- Check that constraint exclusion works correctly with partitions using

#53

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Alexander Korotkov (#52)

3 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

Hi,

I have started reviewing the patch and doing some testing, and I have
pretty quickly ran into a segfault. Attached is a simple reproducer and
an backtrace. AFAICS the bug seems to be somewhere in the tuplesort
changes, likely resetting a memory context too soon or something like
that. I haven't investigated it further, but it matches my hunch that
tuplesort is likely where the bugs will be.

Otherwise the patch seems fairly complete. A couple of minor things that
I noticed while eyeballing the changes in a diff editor.

1) On a couple of places the new code has this comment

/* even when not parallel-aware */

while all the immediately preceding blocks use

/* even when not parallel-aware, for EXPLAIN ANALYZE */

I suggest using the same comment, otherwise it kinda suggests it's not
because of EXPLAIN ANALYZE.

2) I think the purpose of sampleSlot should be explicitly documented
(and I'm not sure "sample" is a good term here, as is suggest some sort
of sampling (for example nodeAgg uses grp_firstTuple).

3) skipCols/SkipKeyData seems a bit strange too, I think. I'd use
PresortedKeyData or something like that.

4) In cmpSortSkipCols, when checking if the columns changed, the patch
does this:

n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;

for (i = 0; i < n; i++)
{
... check i-th key ...
}

My hunch is that checking the keys from the last one, i.e.

for (i = (n-1); i >= 0; i--)
{
....
}

would be faster. The reasoning is that with "ORDER BY a,b" the column
"b" changes more often. But I've been unable to test this because of the
segfault crashes.

5) The changes from

if (pathkeys_contained_in(...))

n = pathkeys_common(pathkeys, subpath->pathkeys);

if (n == 0)

seem rather inconvenient to me, as it makes the code unnecessarily
verbose. I wonder if there's a better way to deal with this.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#54

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Tomas Vondra (#53)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

Hi!

Thank you for reviewing this patch!
Revised version is attached.

On Mon, Mar 5, 2018 at 1:19 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

I have started reviewing the patch and doing some testing, and I have
pretty quickly ran into a segfault. Attached is a simple reproducer and
an backtrace. AFAICS the bug seems to be somewhere in the tuplesort
changes, likely resetting a memory context too soon or something like
that. I haven't investigated it further, but it matches my hunch that
tuplesort is likely where the bugs will be.

Right. Incremental sort patch introduces maincontext of memory which
is persistent between incremental sort groups. But mergeruns()
reallocates memtuples in sortcontext which is cleared by tuplesort_reset().
Fixed in the revised patch.

Otherwise the patch seems fairly complete. A couple of minor things that

I noticed while eyeballing the changes in a diff editor.

1) On a couple of places the new code has this comment

/* even when not parallel-aware */

while all the immediately preceding blocks use

/* even when not parallel-aware, for EXPLAIN ANALYZE */

I suggest using the same comment, otherwise it kinda suggests it's not
because of EXPLAIN ANALYZE.

Right, fixed. I also found that incremental sort shoudn't support
DSM reinitialization similarly to regular sort. Fixes in the revised patch.

2) I think the purpose of sampleSlot should be explicitly documented

(and I'm not sure "sample" is a good term here, as is suggest some sort
of sampling (for example nodeAgg uses grp_firstTuple).

Yes, "sample" isn't a good term here. However, "first" isn't really
correct,
because we can skip some tuples from beginning of the group in
order to not form groups too frequently. I'd rather name it "pivot" tuple
slot.

3) skipCols/SkipKeyData seems a bit strange too, I think. I'd use
PresortedKeyData or something like that.

Good point, renamed.

4) In cmpSortSkipCols, when checking if the columns changed, the patch

does this:

n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;

for (i = 0; i < n; i++)
{
... check i-th key ...
}

My hunch is that checking the keys from the last one, i.e.

for (i = (n-1); i >= 0; i--)
{
....
}

would be faster. The reasoning is that with "ORDER BY a,b" the column
"b" changes more often. But I've been unable to test this because of the
segfault crashes.

Agreed.

5) The changes from

if (pathkeys_contained_in(...))

to

n = pathkeys_common(pathkeys, subpath->pathkeys);

if (n == 0)

seem rather inconvenient to me, as it makes the code unnecessarily
verbose. I wonder if there's a better way to deal with this.

I would rather say, that it changes from

if (pathkeys_contained_in(...))

n = pathkeys_common(pathkeys, subpath->pathkeys);

if (n == list_length(pathkeys))

I've introduced pathkeys_common_contained_in() which returns the same
result as pathkeys_contained_in(), but sets number of common pathkeys
to the last argument. It simplifies code a little bit. The name, probably,
could be improved.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-17.patchapplication/octet-stream; name=incremental-sort-17.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index a2b13846e0..3eab376391 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4d2e43c9f0..729086ee29 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -514,7 +514,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 259a2d83b4..0bc5690ad1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3627,6 +3627,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 900fa74e85..8366a2212c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1014,6 +1018,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1614,6 +1621,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -1939,14 +1952,37 @@ static void
 show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 {
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+	int			presortedCols;
+
+	if (IsA(plan, IncrementalSort))
+		presortedCols = ((IncrementalSort *) plan)->presortedCols;
+	else
+		presortedCols = 0;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, presortedCols, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -1957,7 +1993,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -1981,7 +2017,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2050,7 +2086,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2107,7 +2143,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2120,13 +2156,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2166,9 +2203,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2376,6 +2417,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->groupsCount);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyLong("Sort Groups: %ld",
+								incrsortstate->groupsCount, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		groupsCount;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyLong("Sort Space Used", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyLong("Sort Groups", groupsCount, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 14b0b89463..6c597c5b20 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -916,6 +925,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -976,6 +986,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1225,6 +1238,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 1b1334006f..77013909a8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -373,7 +373,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
-												  NULL, false);
+												  NULL, false, false);
 	}
 
 	aggstate->current_phase = newphase;
@@ -460,7 +460,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, false);
+									 work_mem, NULL, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1f5e41f95a
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,631 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is a specially optimized kind of multikey sort used
+ *		when the input is already presorted by a prefix of the required keys
+ *		list.  Thus, when it's required to sort by (key1, key2 ... keyN) and
+ *		result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ *		where values of (key1, key2 ... keyM) are equal.
+ *
+ *		Consider the following example.  We have input tuples consisting from
+ *		two integers (x, y) already presorted by x, while it's required to
+ *		sort them by x and y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 10)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would sort by y following groups, which have
+ *		equal x, individually:
+ *			(1, 5) (1, 2)
+ *			(2, 10) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		following tuple set which is actually sorted by x and y.
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 10)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort is faster than full sort on large datasets.  But
+ *		the case of most huge benefit of incremental sort is queries with
+ *		LIMIT because incremental sort can return first tuples without reading
+ *		whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presortedKeys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presortedKeys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presortedKeys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check if first "presortedCols" sort values are equal.
+ */
+static bool
+cmpSortPresortedCols(IncrementalSortState *node, TupleTableSlot *a,
+															TupleTableSlot *b)
+{
+	int n, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	n = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	for (i = n - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presortedKeys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(a, attno, &isnullA);
+		datumB = slot_getattr(b, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presortedKeys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Copying of tuples to the node->grpPivotSlot introduces some overhead.  It's
+ * especially notable when groups are containing one or few tuples.  In order
+ * to cope this problem we don't copy pivot tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from sorted set if any.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * If first time through, read all tuples from outer plan and pass them to
+	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	/*
+	 * Initialize tuplesort module.
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "calling tuplesort_begin");
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->groupsCount++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* Put saved tuple to tuplesort if any */
+	if (!TupIsNull(node->grpPivotSlot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->grpPivotSlot);
+		ExecClearTuple(node->grpPivotSlot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/* Put next group of presorted data to the tuplesort */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Save last tuple in minimal group */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->grpPivotSlot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/* Iterate while presorted cols are the same as in saved tuple */
+			if (cmpSortPresortedCols(node, node->grpPivotSlot, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->grpPivotSlot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+															node->groupsCount;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->grpPivotSlot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->groupsCount = 0;
+	incrsortstate->presortedKeys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->grpPivotSlot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->grpPivotSlot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..457e774b3d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,9 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 266a3ef8ef..a17a24b62b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -920,6 +920,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -931,13 +949,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4831,6 +4865,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 011d2a3fa9..6666dd0a82 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -876,12 +876,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -903,6 +901,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3754,6 +3770,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 068db353d7..c50365c56a 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2066,12 +2066,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2080,6 +2081,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2647,6 +2674,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1c792a00eb..c546dc8862 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3624,6 +3624,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d8db0b29e1..730e69f313 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1614,6 +1615,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  *	  Determines and returns the cost of sorting a relation, including
  *	  the cost of reading the input data.
  *
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys.  In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys.  And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly.  Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
  * comparisons for t tuples.
@@ -1640,7 +1648,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * work that has to be done to prepare the inputs to the comparison operators.
  *
  * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
@@ -1656,19 +1666,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
+	Cost		startup_cost = input_startup_cost;
+	Cost		run_cost = 0,
+				rest_cost,
+				group_cost,
+				input_run_cost = input_total_cost - input_startup_cost;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
+	double		num_groups,
+				group_input_bytes,
+				group_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
 	if (!enable_sort)
 		startup_cost += disable_cost;
+	if (!enable_incrementalsort)
+		presorted_keys = 0;
 
 	path->rows = tuples;
 
@@ -1694,13 +1713,50 @@ cost_sort(Path *path, PlannerInfo *root,
 		output_bytes = input_bytes;
 	}
 
-	if (output_bytes > sort_mem_bytes)
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	if (presorted_keys > 0)
+	{
+		List	   *presortedExprs = NIL;
+		ListCell   *l;
+		int			i = 0;
+
+		/* Extract presorted keys as list of expressions */
+		foreach(l, pathkeys)
+		{
+			PathKey *key = (PathKey *)lfirst(l);
+			EquivalenceMember *member = (EquivalenceMember *)
+										linitial(key->pk_eclass->ec_members);
+
+			presortedExprs = lappend(presortedExprs, member->em_expr);
+
+			i++;
+			if (i >= presorted_keys)
+				break;
+		}
+
+		/* Estimate number of groups with equal presorted keys */
+		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+	}
+	else
+	{
+		num_groups = 1.0;
+	}
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.
+	 */
+	group_input_bytes = input_bytes / num_groups;
+	group_tuples = tuples / num_groups;
+	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
-		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = input_bytes / sort_mem_bytes;
+		double		npages = ceil(group_input_bytes / BLCKSZ);
+		double		nruns = group_input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1710,7 +1766,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 
 		/* Disk costs */
 
@@ -1721,10 +1777,10 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		group_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
-	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1732,14 +1788,33 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
-		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		/*
+		 * We'll use plain quicksort on all the input tuples.  If it appears
+		 * that we expect less than two tuples per sort group then assume
+		 * logarithmic part of estimate to be 1.
+		 */
+		if (group_tuples >= 2.0)
+			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+		else
+			group_cost = comparison_cost * group_tuples;
 	}
 
+	/* Add per group cost of fetching tuples from input */
+	group_cost += input_run_cost / num_groups;
+
+	/*
+	 * We've to sort first group to start output from node. Sorting rest of
+	 * groups are required to return all the other tuples.
+	 */
+	startup_cost += group_cost;
+	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+	if (rest_cost > 0.0)
+		run_cost += rest_cost;
+
 	/*
 	 * Also charge a small amount (arbitrarily set equal to operator cost) per
 	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
@@ -1750,6 +1825,20 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	run_cost += cpu_operator_cost * tuples;
 
+	/* Extra costs of incremental sort */
+	if (presorted_keys > 0)
+	{
+		/*
+		 * In incremental sort case we also have to cost the detection of
+		 * sort groups.  This turns out to be one extra copy and comparison
+		 * per tuple.
+		 */
+		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+		/* Cost of per group tuplesort reset */
+		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+	}
+
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
@@ -2727,6 +2816,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->pathtarget->width,
@@ -2753,6 +2844,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->pathtarget->width,
@@ -2989,18 +3082,17 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
 	 * inner path is to be used directly (without sorting) and it doesn't
 	 * support mark/restore.
 	 *
-	 * Since the inner side must be ordered, and only Sorts and IndexScans can
-	 * create order to begin with, and they both support mark/restore, you
-	 * might think there's no problem --- but you'd be wrong.  Nestloop and
-	 * merge joins can *preserve* the order of their inputs, so they can be
-	 * selected as the input of a mergejoin, and they don't support
-	 * mark/restore at present.
+	 * Sorts and IndexScans support mark/restore, but IncrementalSorts don't.
+	 * Also Nestloop and merge joins can *preserve* the order of their inputs,
+	 * so they can be selected as the input of a mergejoin, and they don't
+	 * support mark/restore at present.
 	 *
 	 * We don't test the value of enable_material here, because
 	 * materialization is required for correctness in this case, and turning
 	 * it off does not entitle us to deliver an invalid plan.
 	 */
-	else if (innersortkeys == NIL &&
+	else if ((innersortkeys == NIL ||
+			  pathkeys_common(innersortkeys, inner_path->pathkeys) > 0) &&
 			 !ExecSupportsMarkRestore(inner_path))
 		path->materialize_inner = true;
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..869c7c0b16 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,7 @@ compare_pathkeys(List *keys1, List *keys2)
 	return PATHKEYS_EQUAL;
 }
 
+
 /*
  * pathkeys_contained_in
  *	  Common special case of compare_pathkeys: we just want to know
@@ -327,6 +330,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL); 
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1580,26 +1628,45 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
  */
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 {
-	if (root->query_pathkeys == NIL)
+	int	n_common_pathkeys;
+
+	if (query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	if (pathkeys_common_contained_in(query_pathkeys, pathkeys, &n_common_pathkeys))
 	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		/* Full match of pathkeys: always useful */
+		return n_common_pathkeys;
+	}
+	else
+	{
+		if (enable_incrementalsort)
+		{
+			/*
+			 * Return the number of path keys in common, or 0 if there are none.
+			 * Any leading common pathkeys could be useful for ordering because
+			 * we can use the incremental sort.
+			 */
+			return n_common_pathkeys;
+		}
+		else
+		{
+			/*
+			 * When incremental sort is disabled, pathkeys are useful only when
+			 * they do contain all the query pathkeys.
+			 */
+			return 0;
+		}
 	}
-
-	return 0;					/* path ordering not useful */
 }
 
 /*
@@ -1615,7 +1682,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
 	int			nuseful2;
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
-	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9ae1bf31d5..30b91bd5bc 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree);
+						 Plan *lefttree,
+						 int presortedCols);
 static Material *make_material(Plan *lefttree);
 static WindowAgg *make_windowagg(List *tlist, Index winref,
 			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -443,6 +444,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1128,6 +1130,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		Oid		   *sortOperators;
 		Oid		   *collations;
 		bool	   *nullsFirst;
+		int			n_common_pathkeys;
 
 		/* Build the child plan */
 		/* Must insist that all children return the same tlist */
@@ -1162,9 +1165,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (!pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+										  &n_common_pathkeys))
 		{
 			Sort	   *sort = make_sort(subplan, numsortkeys,
+										 n_common_pathkeys,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1514,6 +1519,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 	Plan	   *subplan;
 	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	int			n_common_pathkeys;
 
 	/* As with Gather, it's best to project away columns in the workers. */
 	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1543,12 +1549,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+	if (!pathkeys_common_contained_in(pathkeys, best_path->subpath->pathkeys,
+									  &n_common_pathkeys))
+	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 n_common_pathkeys,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
+	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
@@ -1661,6 +1671,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1670,7 +1681,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
-	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->presortedCols;
+	else
+		n_common_pathkeys = 0;
+
+	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+								   NULL, n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -1914,7 +1931,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
 											 new_grpColIdx,
-											 subplan);
+											 subplan,
+											 0);
 			}
 
 			if (!rollup->is_hashed)
@@ -3862,10 +3880,15 @@ create_mergejoin_plan(PlannerInfo *root,
 	 */
 	if (best_path->outersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		outer_relids = outer_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
-												   best_path->outersortkeys,
-												   outer_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+									best_path->jpath.outerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+									   outer_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3876,10 +3899,15 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	if (best_path->innersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		inner_relids = inner_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
-												   best_path->innersortkeys,
-												   inner_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+									best_path->jpath.innerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+									   inner_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -4934,8 +4962,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
+	int			presorted_cols = 0;
+
+	if (IsA(plan, IncrementalSort))
+		presorted_cols = ((IncrementalSort *) plan)->presortedCols;
 
-	cost_sort(&sort_path, root, NIL,
+	cost_sort(&sort_path, root, NIL, presorted_cols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -5526,13 +5559,31 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	/* Always use regular sort node when enable_incrementalsort = false */
+	if (!enable_incrementalsort)
+		presortedCols = 0;
+
+	if (presortedCols == 0)
+	{
+		node = makeNode(Sort);
+	}
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->presortedCols = presortedCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5865,9 +5916,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5887,7 +5940,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5930,7 +5983,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5951,7 +6004,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 static Sort *
 make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree)
+						 Plan *lefttree,
+						 int presortedCols)
 {
 	List	   *sub_tlist = lefttree->targetlist;
 	ListCell   *l;
@@ -5984,7 +6038,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6649,6 +6703,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
 #include "parser/parse_clause.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index de1257d9c2..496024cb16 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4650,13 +4650,13 @@ create_ordered_paths(PlannerInfo *root,
 	foreach(lc, input_rel->pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
-		bool		is_sorted;
+		int			n_useful_pathkeys;
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+														 path->pathkeys);
+		if (path == cheapest_input_path || n_useful_pathkeys > 0)
 		{
-			if (!is_sorted)
+			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -5786,8 +5786,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->reltarget->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
@@ -6023,14 +6024,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
-			bool		is_sorted;
+			int			n_useful_pathkeys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
-			if (path == cheapest_path || is_sorted)
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+									root->group_pathkeys, path->pathkeys);
+			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
-				if (!is_sorted)
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -6092,21 +6093,24 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, partially_grouped_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			int			n_useful_pathkeys;
 
 			/*
 			 * Insert a Sort node, if required.  But there's no point in
-			 * sorting anything but the cheapest path.
+			 * non-incremental sorting anything but the cheapest path.
 			 */
-			if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
-			{
-				if (path != partially_grouped_rel->cheapest_total_path)
-					continue;
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+			if (n_useful_pathkeys == 0 &&
+				path != partially_grouped_rel->cheapest_total_path)
+				continue;
+
+			if (n_useful_pathkeys < list_length(root->group_pathkeys))
 				path = (Path *) create_sort_path(root,
 												 grouped_rel,
 												 path,
 												 root->group_pathkeys,
 												 -1.0);
-			}
 
 			if (parse->hasAggs)
 				add_path(grouped_rel, (Path *)
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 46367cba63..616ad1a474 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2782,6 +2782,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index b586f941a8..3bce376e38 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -987,7 +987,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_path->startup_cost;
 	sorted_p.total_cost = input_path->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0, 
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_path->rows, input_path->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fe3b4582d4..aa154b8905 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
 }
 
 /*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
  *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
  *	  or more expensive than path2 for fetching the specified fraction
  *	  of the total tuples.
@@ -1362,12 +1362,14 @@ create_merge_append_path(PlannerInfo *root,
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
+		int			n_common_pathkeys;
 
 		pathnode->path.rows += subpath->rows;
 		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
 			subpath->parallel_safe;
 
-		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+										 &n_common_pathkeys))
 		{
 			/* Subpath is adequately ordered, we won't need to sort it */
 			input_startup_cost += subpath->startup_cost;
@@ -1381,6 +1383,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  n_common_pathkeys,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->pathtarget->width,
@@ -1628,7 +1632,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  subpath->pathtarget->width,
@@ -1721,6 +1726,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	GatherMergePath *pathnode = makeNode(GatherMergePath);
 	Cost		input_startup_cost = 0;
 	Cost		input_total_cost = 0;
+	int			n_common_pathkeys;
 
 	Assert(subpath->parallel_safe);
 	Assert(pathkeys);
@@ -1737,7 +1743,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.pathtarget = target ? target : rel->reltarget;
 	pathnode->path.rows += subpath->rows;
 
-	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+	if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys, &n_common_pathkeys))
 	{
 		/* Subpath is adequately ordered, we won't need to sort it */
 		input_startup_cost += subpath->startup_cost;
@@ -1751,6 +1757,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		cost_sort(&sort_path,
 				  root,
 				  pathkeys,
+				  n_common_pathkeys,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  subpath->rows,
 				  subpath->pathtarget->width,
@@ -2610,9 +2618,35 @@ create_sort_path(PlannerInfo *root,
 				 List *pathkeys,
 				 double limit_tuples)
 {
-	SortPath   *pathnode = makeNode(SortPath);
+	SortPath   *pathnode;
+	int			n_common_pathkeys;
+
+	/*
+	 * Use incremental sort when it's enabled and there are common pathkeys,
+	 * use regular sort otherwise.
+	 */
+	if (enable_incrementalsort)
+		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+	else
+		n_common_pathkeys = 0;
+
+	if (n_common_pathkeys == 0)
+	{
+		pathnode = makeNode(SortPath);
+		pathnode->path.pathtype = T_Sort;
+	}
+	else
+	{
+		IncrementalSortPath   *incpathnode;
+
+		incpathnode = makeNode(IncrementalSortPath);
+		pathnode = &incpathnode->spath;
+		pathnode->path.pathtype = T_IncrementalSort;
+		incpathnode->presortedCols = n_common_pathkeys;
+	}
+
+	Assert(n_common_pathkeys < list_length(pathkeys));
 
-	pathnode->path.pathtype = T_Sort;
 	pathnode->path.parent = rel;
 	/* Sort doesn't project, so use source path's pathtarget */
 	pathnode->path.pathtarget = subpath->pathtarget;
@@ -2626,7 +2660,9 @@ create_sort_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 
-	cost_sort(&pathnode->path, root, pathkeys,
+	cost_sort(&pathnode->path, root,
+			  pathkeys, n_common_pathkeys,
+			  subpath->startup_cost,
 			  subpath->total_cost,
 			  subpath->rows,
 			  subpath->pathtarget->width,
@@ -2938,7 +2974,8 @@ create_groupingsets_path(PlannerInfo *root,
 			else
 			{
 				/* Account for cost of sort, but don't charge input cost again */
-				cost_sort(&sort_path, root, NIL,
+				cost_sort(&sort_path, root, NIL, 0,
+						  0.0,
 						  0.0,
 						  subpath->rows,
 						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 50b34fcbc6..0b5ce4be45 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -295,7 +295,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortNullsFirsts,
 												   work_mem,
 												   NULL,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index bf240aa9c5..b694a5828d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3716,6 +3716,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 	return numdistinct;
 }
 
+/*
+ * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+ * 							  divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into.  Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+	ListCell   *l;
+	List	   *groupExprs = NIL;
+	double	   *result;
+	int			i;
+
+	/*
+	 * Get number of groups for each prefix of pathkeys.
+	 */
+	i = 0;
+	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+							linitial(key->pk_eclass->ec_members);
+
+		groupExprs = lappend(groupExprs, member->em_expr);
+
+		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+		i++;
+	}
+
+	return result;
+}
+
 /*
  * Estimate hash bucket statistics when the specified expression is used
  * as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1db7845d5a..44a30c2430 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -859,6 +859,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 041bdc2fa7..26263ab5e6 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,9 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+#define INITAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +246,13 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +657,9 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +695,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +705,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +737,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +762,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +771,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -807,14 +828,15 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate, bool randomAccess)
+					 int workMem, SortCoordinate coordinate,
+					 bool randomAccess, bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -857,7 +879,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -890,7 +912,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1007,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1086,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1129,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1246,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_free
  *
- *	Release resources and clean up.
- *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1312,110 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	if (delete)
+	{
+		MemoryContextDelete(state->maincontext);
+	}
+	else
+	{
+		MemoryContextResetOnly(state->sortcontext);
+		MemoryContextResetOnly(state->tuplecontext);
+	}
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/* XXX */
+	if (spaceUsedOnDisk > state->maxSpaceOnDisk ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state, false);
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2589,8 +2710,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2640,7 +2760,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3137,18 +3258,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a953820f43..fb1e336b9d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1764,6 +1764,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1792,6 +1806,45 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	PresortedKeyData *presortedKeys;	/* keys, dataset is presorted by */
+	int64		groupsCount;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *grpPivotSlot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 74b094a9c3..133bb17bdc 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2e19eae68..13d9a75b50 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -751,6 +751,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index d576aa7350..5b0c63add9 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1519,6 +1519,16 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 132e35551b..00f0205be4 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 						 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 94f9bb2b57..597c5052a9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
@@ -229,6 +231,7 @@ extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
 extern List *trim_mergeclauses_for_inner_pathkeys(PlannerInfo *root,
 									 List *mergeclauses,
 									 List *pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
 extern List *truncate_useless_pathkeys(PlannerInfo *root,
 						  RelOptInfo *rel,
 						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
 extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
 					double input_rows, List **pgset);
 
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+										double tuples);
+
 extern void estimate_hash_bucket_stats(PlannerInfo *root,
 						   Node *hashkey, double nbuckets,
 						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..eb260dfd8b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -193,7 +193,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
 					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 bool randomAccess, bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index a79f891da7..0926650a0f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE:  drop cascades to table matest1
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
  {3,7,8,10,13,13,16,18,19,22}
 (3 rows)
 
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Merge Append
+   Sort Key: tenk1.thousand, tenk1.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+   ->  Incremental Sort
+         Sort Key: tenk1_1.thousand, tenk1_1.thousand
+         Presorted Key: tenk1_1.thousand
+         ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Merge Append
+   Sort Key: a.thousand, a.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+   ->  Incremental Sort
+         Sort Key: b.unique2, b.unique2
+         Presorted Key: b.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 --
 -- Check that constraint exclusion works correctly with partitions using
 -- implicit constraints generated from the partition bound information.
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 4d5931d67e..cec3b22fb5 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2347,18 +2347,21 @@ select count(*) from
   left join
   (select * from tenk1 y order by y.unique2) y
   on x.thousand = y.unique2 and x.twothousand = y.hundred and x.fivethous = y.unique2;
-                                    QUERY PLAN                                    
-----------------------------------------------------------------------------------
+                                                  QUERY PLAN                                                  
+--------------------------------------------------------------------------------------------------------------
  Aggregate
    ->  Merge Left Join
-         Merge Cond: (x.thousand = y.unique2)
-         Join Filter: ((x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
+         Merge Cond: ((x.thousand = y.unique2) AND (x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
          ->  Sort
                Sort Key: x.thousand, x.twothousand, x.fivethous
                ->  Seq Scan on tenk1 x
          ->  Materialize
-               ->  Index Scan using tenk1_unique2 on tenk1 y
-(9 rows)
+               ->  Incremental Sort
+                     Sort Key: y.unique2, y.hundred
+                     Presorted Key: y.unique2
+                     ->  Subquery Scan on y
+                           ->  Index Scan using tenk1_unique2 on tenk1 y_1
+(12 rows)
 
 select count(*) from
   (select * from tenk1 x order by x.thousand, x.twothousand, x.fivethous) x
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 4fccd9ae54..e0290977f1 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -935,10 +935,12 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
                                     QUERY PLAN                                    
 ----------------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: t1.a, t2.b, ((t3.a + t3.b))
+   Presorted Key: t1.a
    ->  Result
-         ->  Append
+         ->  Merge Append
+               Sort Key: t1.a
                ->  Merge Left Join
                      Merge Cond: (t1.a = t2.b)
                      ->  Sort
@@ -987,7 +989,7 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
                      ->  Sort
                            Sort Key: t2_2.b
                            ->  Seq Scan on prt2_p3 t2_2
-(52 rows)
+(54 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
   a  |  c   |  b  |  c   | ?column? | c 
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 759f7d9d59..f855214099 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge        | on
  enable_hashagg            | on
  enable_hashjoin           | on
+ enable_incrementalsort    | on
  enable_indexonlyscan      | on
  enable_indexscan          | on
  enable_material           | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan            | on
  enable_sort               | on
  enable_tidscan            | on
-(15 rows)
+(16 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 2e42ae115d..7229997144 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
@@ -607,9 +608,26 @@ SELECT
     ORDER BY f.i LIMIT 10)
 FROM generate_series(1, 3) g(i);
 
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 
 --
 -- Check that constraint exclusion works correctly with partitions using

#55

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Alexander Korotkov (#54)

2 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On 03/05/2018 11:07 PM, Alexander Korotkov wrote:

Hi!

Thank you for reviewing this patch!
Revised version is attached.

OK, the revised patch works fine - I've done a lot of testing and
benchmarking, and not a single segfault or any other crash.

Regarding the benchmarks, I generally used queries of the form

SELECT * FROM (SELECT * FROM t ORDER BY a) foo ORDER BY a,b

with the first sort done in various ways:

* regular Sort node
* indexes with Index Scan
* indexes with Index Only Scan

and all these three options with and without LIMIT (the limit was set to
1% of the source table).

I've also varied parallelism (max_parallel_workers_per_gather was set to
either 0 or 2), work_mem (from 4MB to 256MB) and data set size (tables
from 1000 rows to 10M rows).

All of this may seem like an overkill, but I've found a couple of
regressions thanks to that.

The full scripts and results are available here:

https://github.com/tvondra/incremental-sort-tests

The queries actually executed are a bit more complicated, to eliminate
overhead due to data transfer to client etc. The same approach was used
in the other sorting benchmarks we've done in the past.

I'm attaching results for two scales - 10k and 10M rows, preprocessed
into .ods format. I haven't looked at the other scales yet, but I don't
expect any surprises there.

Each .ods file contains raw data for one of the tests (matching the .sh
script filename), pivot table, and comparison of durations with and
without the incremental sort.

In general, I think the results look pretty impressive. Almost all the
comparisons are green, which means "faster than master" - usually by
tens of percent (without limit), or by up to ~95% (with LIMIT).

There are a couple of regressions in two cases sort-indexes and
sort-indexes-ios.

Oh the small dataset this seems to be related to the number of groups
(essentially, number of distinct values in a column). My assumption is
that there is some additional overhead when "switching" between the
groups, and with many groups it's significant enough to affect results
on these tiny tables (where master only takes ~3ms to do the sort). The
slowdown seems to be

On the large data set it seems to be somehow related to both work_mem
and number of groups, but I didn't have time to investigate that yet
(there are explain analyze plans in the results, so feel free to look).

In general, I think this looks really nice. It's certainly awesome with
the LIMIT case, as it allows us to leverage indexes on a subset of the
ORDER BY columns.

Now, there's a caveat in those tests - the data set is synthetic and
perfectly random, i.e. all groups equally likely, no correlations or
anything like that.

I wonder what is the "worst case" scenario, i.e. how to construct a data
set with particularly bad behavior of the incremental sort.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

sort-10000.tgzapplication/x-compressed-tar; name=sort-10000.tgzDownload

��u�Z��uXVm�7#�JIJH#���t��t�tJ7HJKw� %�RR� �����@�|����w��>��	k��3�������Y�qY�YXQ�~�����20��2601��6�`	���?Z�?&&��kfF���A1���3B���cq1�~C��f`b`� �������?kK+MBB�o�����������=�B������)��2f��Q�	���D���\G����@[�������������%�����3mkS+jKs���::V�e:��5�A��~��d���6�2�40��������T�����#�G Da�w���@?WpA��Cmo��~+�^			���������������OLL�������ZZZ������������������iii���������MMM�����Y[[������4^]!A��C@��������P5���*�%���3qtr����S����[��o�+�zH��������;P>��8?��L�9�m%q�G#�����5�� ����p������D�z�0~�|�%���D�����s;t����	��������?Z��}��.VRH,Fa|����`������
z��@>�o���Y"Jr{C%J*������m�Yh)�������k�����j�����.����tcj&sL��Z�x�L����\:k�r���z�*����9�o8*RJ4������k�Q�\M�:�l��T�x�(��cn�
�P�v<�����+�VvW?F��.?���,{.�d1�o�33�V�'�N	
?��tW'z����cAUM��Wa.x����d�FC�YS��x����V�TI�*JNtzGe��������W-+?�)�t�]x��U@E2��j{���7�����Z�s$�!��f�|�E�z�s���,/4
�Q&�p��nc
�E��_��L��4�R~N.�)��hC���BJ�<����Wu�=�6��l.���O���)�������H�5������<���~�9�Z�d��J�X!h@d�z�8+k�aK��:b��
V����w~����YQ2�P�
[���)�������n�%�������������_�O���>ui�>/���^������*h*�.��S������%���;	M��E���\��t�5�L��
��n6�t�qb�O��{�P?t���HU0�#VnU{ hM�M������JUX�Gcs	���?%!�ymV�KI�1a�����g����z��5�ka�c�?�(�M7�N���vvF�#���#�x1�	���,���n
5]\�IN��<eo��fss�����E�OO�&�'d�����{����7�,��h�1�7���0���Ey@�9���������:A���� �X�h�>��aJ�!�]��|�C��*(ZN>�E�.|���7�T��o��NaJ�<4(g�%`�t=4�a&�C�'�Wc3��=�(���:�z�� �`�����'������!-�O��l*�0X������LAK�Qh
>�_���U�5���I�&(S�U��)�&s��"��so��Pf$�M �~�Ar����R�@��Q�"������?��H>]��{�#��Uq|���1��,��<q�.p����1���	�D�� �g��Q���b���
���:i�\��b�q����e���DVt=s4���&$��'���G��ZM�up����_u��L���K��E�@��&�����y��x�X��TO�d�<�e�Q�f��z�s���JS��u�B���l����V���BTr4�����|���]L���7�����_2!�U�E���}�`���=��M�	�^9l���|��(���&�����V���n�5Lyj�1����!�Hu���h������cW�7`��q-a�������%R���N���y�������]��*�%���F��}M*{�3��i�SL�'�UU�����_p&��Zv5�z��z��b��L����~����e�I�$��'T+J��&�1,��[��j�b�@a&�:�k�������������c���R�3�O�Mu[H��K��>���b������DN�oL�y�����Z��t��(q����V-�����D�
�����3�Y�^���Z���Q�7#��U'�O!w�
2���t��}��%$m��n���hQ�	-Q�OG���i	(�Zv��m*����t�')��R9i�"�^j�EN���E�hWD����h�~��4V���A�K�
�'b�{:��G{���!<E��/��]{���,���Y� �7?r�/#� �<S$v��.F�ns�0�_LCXq���"t����6|hB��^p�Vr�7]�O1h����,����i�Lj<�<+y*����p0A�sA�<�f�����3����vT���"iP�!I�Y+�+����k�i�v�f1���n��o��))�����DuG����b�l��%�=
�m5��JI��R[� >vm)1��������/T�1���Bt|�%���F�hE^��>�x!-]�tR��^���$�@����y����z����1�{�Y������Ic��af�d��.a��Rw��(?������s(����K���.]��x)������LC1��MT^Q��bKIoZ1�T:}�0�y�K��q*���������j�����ex�W���U�������M+���M��m#��������<�;��A/��r)�G���=_c�������!�o��
�P��J��aQJ�1��������]h�#q@��-ZvY-�~�C{N�A��`������ZW�`E���6!�����v�,sG�u
:?��s�"I`�A�����k������/��8ej^	)�R����2��&���0_�M�k2�����~�e���KH+���]�eFiB�S1���&C�C��O���b4|d<b}�
��T��g^���7������I12��������Z3��j���.B����5-V���Be)��E�g��O������<��>���,>�
��X�����wK��:�G��jn`&#�p�C�e�b�C���N�i�f�i�|f(A���u#!a��X�+�o����{�
�c����r��t0�t�NW3�����z4\�_���12�k\.j��z>�IdGTU�� ��p���{:�y*���Qkd��P�nLX%�E:n�n����?h��b�*�����b��x�\��E��D���Yv��X>�����/�(���i�j�\�'�g]�z��B/Z���������^�+;�?������0���9��������k��I��tT�-~��m: G�n���A4���M�"N�Ma��P1U�Kw��W�
��'������A�&�F�����'��E�����L9k�q8bSs���:%A�����Z�jt�p����`�b"
t����'�a���\�����r:t�K�0JtQ�,����������xV�sc���c�H��z�2�.����;l���	}F��A^S�_����*}��=��7����"��xI&U��Y�
0O����<r�@>6|D!��h>�������V��lN��_���j�RTZU�MVhW���I�������~���0~�s{T"hn����i���,��g����n����
�\CFjR�����}����f]�b��������D�9/v(*\��]�8��~�k�q4yc���uX�`�b�9A�Wo�({k������M�I�5�?�����=SS��k���"�o!��$��b��!l��%`,�0���4��r��(m����}#��0��9���[�
����{d��\�������&��:g����?�Gd����������=FW�J����>J�1K9
��]�h.X��}������h��q������#�O��aBb"���v&?Z�;U�2��k�f �������lR)�1����+tKF���X!�BUn@~#� ���ZB������ok!�p4@X*qb�����K&P����],��������XX��@��
��"=�7z����O��$B���������
��0x�tE��_P�{+�.Wt�����T�b�E�t��b�a#�{Kh$���z����eU�4_�����U/�j��W��T���Z2g�p��������7��������Fj6���QF�J�ZS�j�c��Q�����h��kq���K>],X�X��������<x�����z�����;x���p��=��r���T�f�b���ZBu�8!q��K�=R���w�����=����,��^�=�����Y�����f�v����d�^�������{�C5�S��:$��7�N����M��H~���Z��$p�q�}�����}B�K{�������%#������Be��R�W���z�d�?1��w��@$����t%��G���}���E&�j�.iI +�����p�
L�?Cm�xYu��&���Q ��N?��`��������k}f$�R����g���JW�6q����x��R��3��������(�{��g7Ssdq��E`K������ V�@����R��C�L���e���h���8�����������m{_��^���bTLT���Rng;�f��f���zC.S�������x����F��b8�R�-�~-^���2z�J:�^h��L;�]
�����������5��a�O'B����2�m~\��7v��N��I���]�@�R���L��3i���������	O���m���JL��W��V���>9u��0���/�/-1�v���=I����������m��u�	��|�-��=$���P6��T�����UtR��Z�
��A.���U��a\��RT�����������F���!��Fu�/rtbp�l��V�0�c�������<�*u*U��4z���m�V��*�%+���s�2�d���\��J �-�p�sW�[�+�To���)#��Y#����HU���A�����FHC�Y�7{���u)7��j��=3<o�&q�HX1r3�4��SykS�����,�]�g�G�v�N�{�lJ/� �v_p�v�F�;N�X�Q��+�|��?,h�dQ(�����h����fO@��L��?.	�&�~�m�0��=��Q�>p�J���g�Y��{���<����N���v�M�I
��Zw��y�����v���F�1C����(�k�c;6yF�%-���QP��t��^";Q(VS'v36
s
<�8������e�;~.�k��8S�KrN$���SrF����JI}����zJ�����q�T�k,u\~�|�VV��1,X���}q��vHf.]�(2�H���j�_�������eH���������%���������K�z�q�a^�d>�w��+���H|�
%��/#�baz�s�D&�������������\��t�V6}%�j�p)����9���kx�h��)��?dU���b(T���a<2%	y?Lh�0�B]�8����1�<8`�'�Q��7�E�c!���&x�		�s}����	�5*��\�~�I;���?�����r�Kt��o^�o�������811�m
1|&�_���)%W�}�6�R���2<!	���d%�/:�qK��������\�8FB�
�-[�?�]y%s0V�{V��V%�������	��������IL�47&�B���aM�2!:����Y?�&c���ABZ�W}���>��"H���%�����4hU���<����C^6��I`s
4"� �#�J�;���z� -�������&$@��z-��w�4��b����O����;����^.^?�Q.B����:662��w}g"������������E�g`�S����!YbO����0A|#�����k�:	�I9�������m����9u�(�F�qu|�����Q+�o�:��9MY�'���7}H��o��w���]��'��&�u��%)Z�v(yX&I
��G��t.U��s�v������pQ��VO�i��y�7�!��C����FyV6&��[����Q#�A��A��
���/��������N�X�G^>��"L�z����M�f�uWpGj���QG~F�m#Q��~5Y/7(W���(�6��Lk%�����+����#i���Cp�'���~����H���w���Ss���]�K�_�sd^�T��E�%�g�l������f�+���8n[�3
��4W����m?�J�~BW�W}�����~�������i>������G������V�1q.�\����V~}S��~���y�*j�����h��bt�q����:����SE}?x���v������{nG�H����0���+��pB�%~�VC��o��9�d����p����=�fI�G�.�3��'���/�1��:N��}�g.vi���Z�
��tHi��4���Q��%	�Y���u�+\H�����������}�
��~2T[[�r�cK[�@y[\�r�jn��b�p�
=SP������)D��!:V�@x
�	'D�������E�9������w@.o��*��Sg�(���3-v�bY$���8��O�;V�����	&��(��"$
������j����$&�B���#kh����|���Rt�T���)c3#maU]�m+%XlA7vi�' �C����{�c�Z��{��d4EG����Vp�Dk
�����[b�=I���	�������H4FK��K����$�{���{�^�E@kt�)��������"he)�z5��)�y�U����2vo�2�;e�������z}�B�Su�&s{�S��\�Q�X�%6K�aiv
{�D��������7����
�q�T�{��m���{����K]{4�/��_�S�Ov��V5���d�G
PdP�H�.���|���p<i������8G�7^  `�o�zJ�V��:��eRI�l��eZ�I���8�G���2�>G$�.��M4���5���Ur~�Y�-���I'%/�����d�)~e�DB��s�L�d,��x���jh���E5V�����8���H�<�.������F���0�������\<�
_��G��)�h���V�d�T@2g@�=��"�"0< /7�i����S��.7��${���	�����"�IVi=Z�����y�3I���Y�Jo���AT��,���H�[,���{St������P|7�E?�c��e$?���/�N�ic���T
��J���u�:���a�����?����i�^�_�>�����lf��c�xn&���ZL���y+,���=����.3�Sn�i�O�F��u�Np@����#�vJ ��Y��R�
������8�~�7)+'��7�\�}9��K�����&_�����������/^3A��i���q�3V� g�(y�-��5D�3eQDZ
���I>go�
��]�_��y�`H}�[r�6>�(&�q��T?ZP!����;`l��5��^�cnh*��Z2�����y�J$�B�#k��s%�m$������I���GxC��,>�|����"�R	��w��g���~�l�JFb���}�1�nf���V���f�.�"��Wa���+�Ie��O|K�j���ea8}Y*lc��Cj�����(.�2�X+�����
Em�~7��q�t�t��Gm/AR��z�SRH,,L��z��@����������_���R_H�o|�=�a��\��2oZ���!A{�p����������d@��R-�n�|��R�!�\�pz��>�u?�~Ri�Y�d���6�O5�$�9Vmd�d�����}��K�D����h1����Ym�J���8�	"�k��o�-��}���pE�'�cx���~QR���-�H<(��Wf��W������ptsj�Sh�_�a3t���@�[�)e5�BVw8����Ki|.���?B�4���%�R\������f�s�sn��a���[�
v����i�2
F)\��)?~����=��� Ku���.�@��o.Y��������1�����#$3��;�����Q�_��S��F7A������!�
ms����}3�K��rD����
R����y����(k���5a���'��d����?`��0��7�6�����2�����3����)���}�������H���p"�0���)8R;&�������L��L��LF�#���6�
M�
��N4��V[�'��X�P�xY�L=`�d���L�Tl,�\��0��\�Q����k�����BQ��e�>a�sw����������5�@G���h�G���$D�V���yU��rv���s��E8��iE�������e<��&om���N_-{}��������m{6�3�S����l.���]�k����^�8}d�(���F�-*�����h'�����W?��	;S���z�U�������%��9���N�v3�HS�M(Q�,n��
k��>C9�mo
d�[;cq�c��tq,SX�^.�;��^E�0:X���`������xF.e`M�}��)5Y_5�W[�QV'e�,V�����m�%J�>�C=�6D �`�W�����h�����r �s~n��������������OvB�H�z�v�aQ��!b*��s?�4����+��?���V[C7DM#�S�C�uY�B�m�;j���\�B�`������0���}W�����33�5�����aIGci�iem��iA�G�?�30������5�2�4��_�OJ�����MtL��_��2336�4�1�_����4��u�u@�f��m>�F��iZ�������H�'+�-��,t,�8��<���o��������������.SZ�-/���74��/:*
�~0z+@����N���I���&
����*����k������A���CX�t�%[5ue�8��H�K��g;�I�U������^T��=��/�b�8�j��X>
#�,	����58�w�a�������ZK�bD��?��5��#��w�>e�!����wxZy�UI���G�I�6�l��j�x�rhl��������6S[���1������`=��3M�O�)�Z�n�tI�s�=WPb�o��;M���R�.��P0sJ����A[C��uk���x�C%")H��9�v�y2b�M�(�K���Z_�r�����!r8m�������M��)�P<l��������L��-���j�q����0�7��V?��b�b��%��L�{�e��������}v�]HO
�dw��]�^�]�����|w��FHpZ��oji��D��C��9�{~B���o�h����n�N�B�A


C��o����������������[�q����}��X����������T��?^�~�,T|^a[x���w<��
e,��3j����;�+Td��X�A��=�yV���ni�iPAu�}��A?�f�V�{a�G�^�j�"�i__���h��y�Q�������5��<���
U\?OR�0(DO���'|�
�?���L��G����<$��)m����iw��Y�4���j����eo��������p~�4}������4�jCjx����tw�y�������pam��|������A}�Y������3�.5����������������3����IB��.�'��j.��O
�fz�7z�W����O��\�\-�/���++�E.:Y�j�&]l��������WKn�V#c���;bp��:\�46ml������d%!q���m�d
���$�m�I}$p��WN��q�>9��*.8��Z��*�Z]�,J�O?��;[[<��2�v�5���LAJ��J[���W��3<]���zv��lo���t�G�rCH�)���O���I[�rn��#G������������8�����z�\�i�/Q�X'��uu�����+��GE�[M��n��b^��rO�L�M9a^�Nmc���:�L_��E���S�-,L�
]D��BP���-o{u"����~�u5�z���|��P��
����bp�g�������$j�%�����[ez��A�n`6�E�)k��i��\]P������uo���;3��N����\�ku��J��w��+�d�h�G���d.������+������������	���������.�K),�N��C�������a������K�i�������K)��&���������/\��u�fW�q��\~_V�����Mn]�L;�����L]�mN_.��O���<}�����T�|aX�����`�N�;������r���,�����6/m]&A��y��q��N3�e��S���?uy�Cs��<Lp�3�2y1�o��'�t5�^'e��������r.���;>��Z]QI�T'T_r�rr�z^@�������fe���f�s��b�Y�|����^�����"��^���S��Q�0�T:�i=�����y���������q�8��n��r����������Z��Z�����I\�W�]�j�GI.#4�/�����\��.[
\����N�����L��A�������88��V\�ey��b��X�XJ������������^�b]���O��zi`��X�-�z�D�A����
�������37~�@����
��"��C��S��<�W����I�Ux��ufYzM����>�eu�e����4��\��}6?�o$I v��OtE8X^l�)����M�v�7-q�������!�_A_��.�����`�rYnf�R���v��x�X�X�8���zZ�|��b���ZX�=3
������p������Y��sYr��H\n0����E��^g}��(���l�J�FMM��\���|�eM3�>b�R�G���9S�?Z��������z��qDz�?e@��Bc{Vh+��~��z�P�p�,4�5TWyz��t1�U_�~�8ur��z5��s��uup6����+�Q�tv8|�;;�|I���xA�������l��h�*2m�j�^u:=kj�Uq�~��`��+�$�^mzj���l�(���l�<j:��-��i���������=���I��#�PX����L��iXN1k�j�[O09kUe�����}��pq��[���`��
�L�<V`vy��N��s�Qo�s��i�%|dR��@�13y���4;�pi��7]K��o��rK���(�S9���U����n�D��]��A�����?����[��+��cLSu>�s�g�>�b������#O�{�|���x��GV�0�Z��}����b�.��2��U�����O�0��O�!���*������W����i�2�<[3��K�_����k_�ws�]���K�zn�n
m�R?�OvvH��b_��� 2����:���������	Q�������z��.Go�P��t�9�u����t�C�����T]������	�Zg��7�\%\m��{��#�?/z{���#U�_�u�i��i0�Ss�,^'Hzqz����Z}~r��2��5]� �wu�b+�}u<��RQslE��"�����v����w������Tk��N��wP��Wo7mM�����tr���j�l�<Mt���?:�.	R?�����W����t=^;����#�Z���	\+@�L��Y8c���:bVs����jt�d=v���=2S��=^[���V�8E�:���x���6=Q'��gw�3�� �����`�2o�zqO`�������u�6e����|ls*�_j�
�]-Gf������/��=\-Gg�U@����qI��������_�u�,��������r�G�j*��Tq�����h`������z5�s�����su�����$���������?RW�2�;�.|���+/w�Z�.b�U��[������q��:��W;\�.�n���u�y�bq_9��-t��M'��(��:W�:,&�$�',E��(��uo%,���l~����gv
��(i���8h���>�rj�BV�T]����>�=�C�R��N�M������8nng���k1�+���Y���B.�q
'�}��/3=!7h�6���?�!|��6�]��{��������;�>=�%D��KExGJ��-Y�����16�����C	AV�d��o�Q�U+����bx���F	3c7?���.�w#�\�n���h��4�0�<6��o�Q�Q���RM$D�i��
v<x%����=�l#oJ�1�����6���#C����oy�O��1��'��{�pb�D�MFp��(�&+s�8�{�A�2��A�E���c�O���S�7����/�0������/ Z4����m��z�R�	����m2H���Pw������<uu�Y�\�0~|�k���]��P�A�`�|���;����S�X�tA%���P*n���O��9l�'~�5�����3���SJ�^
|��c��gH�|W����1��RX���6�Y���RH@$D����<r ��D�E�VWC��Uh�yu1����~"3lK�g4q��hT�
c|'K�����B��(���`$��W�yL�
��Q*R����^N��J���&��uXy93�H%vk*��_�����e6�x�Kb���<�m�p���
d��p��*N^z����{��x�_l!����7H�|�Aj�kds���D6��u
���=��`�.7��������?����F6%vpPX_�N�W�mK������=�C�~��~j�i��8�+_�:������l2�F��-��>M�Gu�U���A�U,)������}o]�}��v
T�>�Z���j��~v�|Y(0���u/t�K�����,VB�??�A�&?���������=n�r���K�X���JzQ1�;@x���W�`)�n�������@���T�M(s@���`,�W���Z��@6u�BH�l�����2�`���zRN0����8�h����E�Pk���-C�:���k�XU/]�c-�
�o��C�.&�|��'r*PN��:IE����0�e����t�Ju��������a9���T,��SW��v}#��<qp�|*e�)3�=�PPx7�~NL����{@D�9VLwOn��xY_~8y0���g��^c�����
�l2�D�}.p~�@���������]�pX�gBDP`��p���a6��kH�PP y��1�Uds��1�����g���p
�'��^���|a72�7�	��� D_���{�����(:^8����(����YQ�Af�fy���w�����`�lz���fA��W�dp�������Fp=�$�(�/�m�Hfd�s9��e*!hf�k��D��ID:�"��3�����H%v�u����mC��x���"��3�Tvv(��c�43!�==2t5���
j��������{���.�<������h2�����U��'u�H����F���N���e���[]�D�c��O\�����ol�5(�
����A�1��Jw����m���IE;,F(�$@�8�&����R'���c���A����jTi� :�#��e@|�}V��A�����n� S�$&��	��Rm�
�Gd����w6�_��e��UD��%���_I���B�:[7��n�G��n�H��{��E#�=�r�@[>�V{e��(FX)\
%VD��1z�=3*� ��MP��W�%����2�9�G�u9��w����� ����?�L������>�^�4��[6�p��}d:'����R���PrPy�D�Cz�R��4���0y�PB�
Og����F�R�<���N��e�d��^��=�u���	%�������q���`B�bp���zV�,>)�Z��`+�]c�Hc�C���}g��	�'��$��v�����X�|�8�����y�j��.����z?qo�I9���3��eG��gNXo�4F=zq{��zx!����;=-DF\P��]P�2�r�x����h��p/a�<3�Z��s_��=��OlQ������7A��|�,}��������,�x���'��-�{NiT_�������	@N���RjM�a(���0B������x��H�����O`AQ�+=������u�ox2��_AB��*��\�u��5��?"�(=��C���n����F�!*�~uPF����+��H`�'l������7�������4��:P�������O��M+���8gq{t���&�J�7�I4���v�|K<�l<L*@�D/m����bo8��L��������o���
(zX�2O��YU��2�n �Hr�����
��)����f!a�%�����'0�I�K6U�N��Q�%0Zy�[5�`�	���|��:�!0I
��y��/$���J�J?���eG��u4�R�|��!';��L?.P�uwpIR��x@��y�*��@'v�O9L�#<Y2T���;��o��R�Ao!E�!���x�B�J'�p�&�����
G�4�A�#�O�=��[�+����&*O��i�w�����E��gz��������)z�{�D=Q8�y�/*H]6�������_�����Xq ��M:���"���)�	
��&���"�
���`��,05���1��6j��*�P=���R<��vEj#�KC�L:�}ATU�����WPoM"MtV{N�E%�XM�f�h���:,��3_���J���I�>3�)O"	��:t�����d/�w�+��H��!�g^����|A ����X�L�����*�����#���p�^���A���m�()0��v�&;���I�f�r�'9���v����P:��-u���_�A��V���V�����~w�'JzUm=K�,k����t�<��?Y��GCag�3��M�����Y1G�-Wc(�����������5� �6?J�I�/UB�
��-�S`B��f��t�����B�b�h���^���F��?�:�ko3D|������a�y�u���F&�(!�7�}�n]UQ��4W�@rO6�H�����PF�%r�\���?��{BlD{E��9��@�C<U<i�x�M?|I��,�V?S�Y�e���@���������%M������������1���J�1�Y<�K?�8����).+����-VT�O��T:l�E�w�D��8+�9<9�����[��|�����P��'��Ag�~i�N�\-&�&��2�c�|������)�	�3X4�FBV�m�?� �.~���CSt������w�I��W!6��6Srb� ��0�2��`�T��XD/zkJ��%A�����_uI�]����Q�c��u��fQ�����siP$�Gb9�/e��@�5�t���v�Y�C��z|Y������M�D7�`$w�B��C/��k���1K$�tg�G���7���c?�x�^�E	�?�L;&�	�z0Y��SP��@.K�z=��'m�����w��^IT���Q�������@�5���8�p�}����j���,#�~W�P�.���1�l�S�z�F1��}7��a��$F�t)}x-4[�/c�'�������~�a�����X�T#'#;�B�g%������3�~b���-/�p�[b,����g�}>a5����-i���t�g>�)�O���=�3��9�@eK���3+J
0�����;le�e�h9TW� �{�1���1��3�<��ED�tP�F�(Q���k&��.w8�z^���A\
�l���R�C9�>�~^��Z$n�,P>�h����m~ .w�� e���c�����
��@���:�t���7�`�V��V������W%,��(Kr���
Z%O�������T����Hr	K�MR�\�aJ3�����DV�;��#���-2D�H�����y�-U�y��;���g+�J���Ytw\�S@^o:��9�RL�R�?������R�w��!t�%��^�`�Sf>�����B��[���������0H
���z�,H`��1�V����y��0��I�k���PT��V+�Cn���`���}mX��aS[����\f���l�~d������o���+��<���	|�����f�`h4J��
"s&�`��S�B>�C�$�
�Mw����k�(k�O��@�a�R����u=o����{�`�^���CIl���?�0�^Ua�P����nA/:����9-�w�-��=9y3�!z��>?��d����@�:E�eJ�n
,^�����K��g�[4C^	XCoy��j��iU���-x�>�5JSMQg�l=�G���d�")���\B ��g�}�����&f\D88g'N6k�a?�>�:���w��XQ��H�~S^hnaG\�\�l/o�|��J��X$h����,N�w}I�46*�U$Q���<�;���L�����\
48�x�u�
&{�hn�-� prl�����M��}K��m>��W�f�5Per��'������n���k}hq�S���	��1���z��^��ugEj�cY�����A��l�`0m�p�Z���Cp	�;[��U6M��3���$�� ����?�d���6xV�6l�}Q�d���<�����g!2�j�[�
��T�'X
�g3=��(�<���}o3=9pf�:��������J���
P�K�
���qx|F�H�e�0����(���;9.�#U��[�<���G�,F���k9y� -V��kB6k`�,�?����}W����(��������4P(3��fU�e�31�la�o����B%��0L�(�Iw��7�G21x���/��������p�bD�r�h���A:w��;<O�j�XN7PJh��1���PL:����<��%eO@(1��%���z*U�-�+:6]�%�._�
>������u2i���P5)��;����z��?�y�&��qg>�KFce����{�h��g�mH7w[��k%}B�!�"�s��-$|E�\ut���<-�������g�/Uk}�`'	�cs�*�z��'�u�X[�p����z��/w�{O��������i���d.R)
>�kW�����I`*����`�K�G�G9(��D� �|M����m�������DJ,�]�-J�PK��"�+j�������	g
]����I������q�dzV��QR��.Y��A�*Ys!��WU��N>�!���$
�����G��/��Hr�J�LR?�%�t�#����&�+3�"���@L>+��	JL'��|�KE������EZ
��Q��	�	��8��E�
ooP>�Q���5��p��������w�l��wp���y�l��B��&te�N������&����28���
��7:/hn#������I��g�B�^R|N�/�t��-:���������6K�}�M������)�S^����*l�'�_Q��R�Q9p@>k���/� =s@���%<k��� ���c��K��;P2;�e���8�vV%�����g�O93}���`�x�A�e��(��U�Z����g=@)��%@�KsQK�\mkr�`�������,	w��f_�id����L�2|Gm+<������p�7QxnQ��yH5���`-�[�9�&������������B����\'V/4e���i�����
o*���3S$�.�:�e��9��8��/^j�3mPrU���.-j���RUg�!va5o����O�D�c�=)�dPQ�B;���Eu���#�Gz�{|q�D��k�3s�/��)��m��:����jh<��$X�&t-�mjjg'ye����D
��B�����4$"PT�6�E��\U��l���?�w4]��H�)2�}<u�o{{��2�)��&��gB�wz��*f<Dv�82���{���/$`�����B�))�%v0�i=?�J���?(!�g7�M��v+h���!9���$9���R�T�����:J�5}A�@4,�a��0���!������%�,�Q��g��y�W^f\Y��7�-�����
����7^���<*Z������;h��ev���������������.�6\�h�II';QZ"8��)dh&�M9h ��y�S����XQn&vc&(��2��������&����uU�mc��g|S��B%��p�A��V ��=r����{�;�T������{��Rz�v]rG���Lo{�3����S4�"<��]�����A��s�~W��o+�oEk��3�0�:�Z�*�,�����-�����Q&����u���[`����A{L�s��_~��
�NF+�!c�l��?5����t���Bz��9�~/��t�('��5B��MF�� ;^��V�NAknV����>�����+�&�JS%�1�]J�P���c��H���]�"hK��y��r�+X�AS��h��$T�k�R����N#'�z�}�*�U�b&-������F_D�l�m����?Ey8���n�J�t����R�[I��g�]�B'h~��������������7���`W3�����v��1^I"����P\B	���u�F�5v�����
����Q���m��EP@���'�JE0��;�����H���2��RS��\o�Rl���c"��p�<O����t���d��p��=1@b�:���12�N��n�@�q+i�p3����5�9��������^J�G��|�X�4%z�$�z��(�h�|�0
X�GR4�m}�������k�|{�.$\L
���?��l�#���,3�
�����] /�j��fS����.nO��|,zSQ�$]?S�V��-�?��6�x������p�#$����f3H
���P�a���S����/�=������n�vS� _����h�<�i���=9��p+�������J�n,�n[���Z���]�u�w�����l�?�]��j7>�e�9����=�������_�����r�����p�����>���D�o��6~] ����%s%:FX��%����x�s�q����o�l���q
R��vv��i$��Q����M@C>[>n;�U�����G�����Hb#��&m����4���aL4�A�j��^�2���vS����!����<�a�0��G����_?/HC�������6\���<�%�wECl���j��@B��k<'5H����7R�VNE�%]Jo:�����6����o�<6�L_�^H�U��\D��$�����V*8����~{A*Gm�_�/ e�;�Hu��P��b��Z]�KS��~��'���z�,EU����s�3Z��C����8�������A��}hr�TM����~���E���&�ug��d{���]*�����d�Kt�����@�vp�y�#��s��`���^7
%��3���h����N���� ���	�\6�i�X���M�m��ST�-�q?�:�7�H$/]+@���X3�NQ>9���o��r�z���B�6`=#���O�����m�u.|o��v�zp[��P�~��<(��}qZ*�������6�Q?��g.a�������S#<�>z�#e����? �@<��#}�~yS`Fw��4��@]��}�j�I�
�P�����r1'E��
��0? �BX���
Y���d>���E�U�����AW��?�\���j�Q8�d���?���\������64�0$�&���������\��od����P7E���C�G`v�G����/�~6�E���S~�����w�����W��N{����ZZ�0�9����:��=z�^)��
�|��U2�H����Gs���������/���ij���8�b|��Pzu��p$��<��e��d1�yT�bp��&�����$T8���'V����������f~����_x�n����X4��w
~��#U���Zt�/���^y�(���f'r����`�1*EE�4���;��������t���8"�e��b�`���$BF�S�W��������C;\
O/��`�������V��x`r^������E�����W�}{4��aD?�vr�!i!~��04N�)$�{|�n�yw��@�,���t�v��3�XX�bx�V��	�5"��;>��I9�jm�^�O�"��U��_�����G�	���P]�����=�x���v|�E �I]�-�FGDs�<���
�A�l
$7X�����))$Z�oE�yo�=R/�m'��nv�Dv���4�����+�+xm9����@`�?J!��?��b��|�n�����"g�Q����Hgv �;��?����K�@�H��l�{����{@����J����tE6���GsE/��0h��$����������.�A8]dL��(�O8��'�K7�c�b�$NpC�H������t������i����
�2�w?pQz<'\���3t�K�>���H��5�<��$�0Rk���	x	"Z����jw�����@�)^!(��n]�z�R)��R�����i^��^�������c�?Q��c��������S�_!�B�
�|g������`�7naMbn7_���_�nQ�+���Z��)�e��Kp�H!���|��U��� �$s_��m����;�%����#o�_�#�1��W���g�.�}���Ph��Ho����A*�yl���>��E'��)6��3�T��%o0����~#��B��a������������O9��Q�������HZ�'a�������	���dT�.JPs}�%5�������0p��>h��o����^< &b�>�c�d�����>~�R����Xw����/�Z��V��'�h��i^8����ps�=�QG��v�����JPT,�B��,���%(j���������Oa����_ ������te���ua�����
�����W!���hQ�wt�G}��>����@E�!����)�%e)��n��vs}(������j���q.��`LN���h~�����db���I
��P�mU��R�\��)���R9F�
\�i^�~��(X����o�0}V��dwp�E)�I�y�I�n���	<|�C3���n��_�$�!�KS�N�T��.V�����{�t�A�ph{���$�(\�w�������=�}9�����fy��a���O�sRg����[��x)�ky�:�JY�>�~W���dc�)@�;|���
;2^ Q�l��l4��sJ����&y�*�������r7��
2�A!���z�d�������'�����\S5���1��-)�%x�'aF)?1/4�����&U�^��en�`�U��:T��; C�X#�A8���5��gjV)�v{�W�F�$>���P�b%_��6�`q;S��,���U|9�6���&y���(�A?
��,H����%�����D��q�Wm����ctX���r�Ju]�B-5s��h��25���_�w,X]�Gw����H(�� F
V��7���=2�/�!R��F!0����6GJ��|5�'|Y�F�g��<�4n]9�|���q��=�%y+�d����.`� F���{��*>���2�`�^E��I�m�[Y�����t���lO�V%&|�I�5C�\�-<Z�m�G��REw��*�E8a�7�M��E���h_����Z�NM�oaA$�L�����-
���^���_���������K��K���s�B^%>��*9Ef�]}O@�����R(%�gTT)�uwop����6���ty�;�I
��&�&���v��WO��d��I���?��&0�	�a��m&���U!�������l�,�(B~���w�K��
S�F��wj�D9�u�x:�~�ss?05��W�F�!�_$�P�+�/�c59��@��6��~�O*-�*o�I �bK���#.p]���ws�ELZ/������^��3����=	w�����D���c�u��JJ�4#�Z'�:f\dp�5"��4U��v^���O��5/t���"���������"n
�Z��iK[`l���7u��T���]��)���n��\~������Y&
�����w����b�|�}L$@��kr�{p���1$��n5���hz	��m�O��+*X��K�nu�ea��n��88jj)�5O��,�":.��K��HJ�����l�Sdx�������s������&k������Q����$ql>�n��ZEM������G�!��>��4���O
&}Dv�
���i��$��#�T	�S�D<��>�w�����I%�z�(C�\� 5B�Y���S\�DU���o�v�0��?��\�{i��vX#GZl�~�!�a�yz� f}���������2��-"Tz0�?�a�V]�V���h�U����i���eI++�u_H�
[�BI�*�{�='���Ps���������=,Ut���7��	t��tw�x��HB��k�:S$�vt�������*5�|����g'}�8S����Uc���y��\�)�?m{e��d�N8�2�z
b7~�5��7���P? t�=��q���`��Ra)@ �
��CJ}��E���B��Fg"���^z����v
���'��W$�y.<�z�R��M�?��9�����[�s��-���{7����v����~�(�Q8����FC�J���3���N>9�z��^4��lN]Y��������� �����'��d�}�D������,���)�y�����?r"C�i9�;5�%C+���)zC��9}u��`�6+���8%�d� � �e!�7r|geIN�������Qj��\Ab7�tN��t�P<N����h��$N)�����l������LE�����qw}3���QE������KS��-}�r�.����8�	�\RM>Dx�eH0Lu���S���g|��F��w]���+��5�.z��yk��n��/��P�X:^5�D��|Y�Y���r)�`�����u���(<��Y���F��H���y(j���D�,J�����bI.�	q6z�����.�4�u:����ymx�QU/G;��o���T������k��^��?�|qX����-#���������/�"��������n?a�w,�<|����?r�al��`��d!s���Q��Z���R��5O|���G/�}����_!��
Yp��������}�?�0|}�~3�����+x!����R���j��+�o�/?.{8MY�sl����'����2o�e�Y����\Rt�=r���������g`�8�k�������2��%���5�@����d+��(MQ������]���{�K[�V����:�@�+�(�����6�O����H��%���7^KG��3����Z7�?�]�i���2����dw��;�@&��g�1����}wa�g�����0����t6M���])jj�/����u�����`��R��1��`��_W!�bDU�j����F]`���>-��?��s;%`���?A���"��QR9*���,�CX&�r��3<����d�E�>9�<c��9v��g��N}	Bd"9,���m����ee������H�7��zp�e?<����!����^�*]R���!�@����3+����e�?Z
	�g)��-��������{�oE�q-^T�|��^�@kX��.h�9�cN[E��y�� �[T�	���W����0���w	�d���]R Y�
u���j#	�Y����.m���Wd
<s�$�@��8�k��~XtN[�u��
�i�w�D��(���h��������2X�^	��yP�4��x6����w��-�d���:m�d�y>l�j� N�f��V��a�aA�t��u�)��X�p�M�gDT5��Z�]CQ��`��kh{������9���@LC��8M�:�����iI�?D���s��wn%U�[���<�6�s��z��~~�RFB��=���U.�yY��R9�M�@�l�)Z=�L�Fsi�z�"������X=���{Vo���O��P�tJ~��$y���>��,Z��[&v������c���W90����l�~I�$*�AAc���=��&S�eH����v�+\�z����'�+��[���1��������t��&l��Js��j��i�XD3�@�����(�9�|�s�xF����|����;=��;��Y��r\���f;��^�[J)�k9+�������j*��v��,���Z��w����?SR<y���K��� vu���wp�e9r�r�T�������y�����rbE�{�1��]�+�b�Cr�TS���_g��hw�E�U�b�����T�2�+0'�Q�������yfw��.�-��vS��+�r����b��!gD�g��Y������0��woJL)Pq���B��o���|�>;�jND���������osO��A�e�����
��]�u�]-�	8[�:/����w��'fS���P��x
���0� ^J���*���s����+%�H/S�����-k�+�RG:-�R�.�<�VD�a*��&�@s�����2�}��u�t��~�z��xCQxp���~@��U��-��,���)�����=W�3(s�c��=��O���J�7�d��=���q�w�Aq�nb{G�0>����T����6�oe����>��l].�{��C$�
�������(�'	N���nb�8:��Fi�o�R��_���l�v<f�����>���
��u6E�� �^�h���v��j��t)��z�l�j:��h��������;�9r�F�[ �J����r�����_�[aN�y���������2Im�o ��o�����W���@���`&$�"��_��N�cQ#�>�*����7����|�WdzI"�|��)��AO>�����������4����E��C[��[���~$���mF��7�E�����x=��p:�R|4dV!��}��-����g����6z�
���b���z4�X�� ���/�����Q�C��G��K� 5��a?nr{��C�����Vz�X�|9����o����uP��������(�aI��H�������n����0�a��dz��2�7��l�|����n6�t��I��9���M��1����4O�_��w%��o"��;�yqV(����j�q�H/0�d��n�6�?����&�W>EW!�����.>C��*�C��H��S;|"�Z�OSW���|��SW�[��G^�p_R��������&c,���n��E������e�����zU,�t|w �b&in:����"�����H>�������7q�,��U��m�H����!�b'������|6 ���F_q���*�p���3�h���?��+����Q�����o}����;WE&���
v��<~��I1�%�L������&�����#��(�e��'NZ�/5\,6�]ow�A����1�M�X����s��"x�B�����90\������������������y���PK����P�,yQ�j����F`����������zE$��K�3��`eF�.�M�Y�(Xy��o>�@����dC�?�}�=b�x�&w��w��^��D����oRF��E����gD��_���}|��
�wS ��g�`��w�5-���
p��G���0�)���8��
���g�f�_�����@�����!��Szm��%��(�sj���7��%i�(�o+��t�X4��7%lz��CN�[3������+B����$=@RC6�D���ik�TlU�2��V�"��1�2�u�������A���Z����N�n��s����R	��<}�{~�:[x�}�n��c��d�����rL��9�����q4%�{�)������Ov���p�z���^}�
O�
���������%m|3���vK�C�~����PF�{&Y5�	�B�n������ w���|*���Nw���v����n�
�?#��y����uh�4���l|��������zk{�[�r���"��|�i���,����m�xQ{{
��B���o�'����oo�����mGI�/��\`��*_�<�E�q��S�7AniP%����K��K��=�&+��v_�����~���H��0��o���2����1�:Cc��B��V%�&�����]��E���z��o$q�u���.�����F�o�~�� �t<
�����v�g^)}��w�/q�"4�8	�g�o��9q����P������M���l��z��)�6�������<J�as��j��$�����}�|�7����^[N�����{"��	�@��w����u2�d�����B�u�;`s�0b�{��6,���PQ�Y�k�
�q^5�J�%�;��y�����8���5��9>oX�Nb�S$�hq�/��V'!B����~��0���#�!n�k�1a��L;�`?�^�]��F�k4Q&p�yK>;�����2@��[L9m�.m�;��&�aB��q���K}^��K�:~(�����
�����4�O�+>�;8�+�bm �n�8�Qd�ywb�"��4���K��:pR�t��RK�������m�'�M�'
�c�%����+���:O���m��^W*:K�M��=6���7d}~���[nx��@�w�0��5!Ls�#|��O���j_y��i�4^�t�#���)V�7 �_x�`�8<o7�����WFJ��l#�C�4A�������OQj�C0D��X�$���%mv���@$���	$O�d��c?�������i~��4U���L�"�E���VyH�HL�4����-��E%�B�����k��G�Z������b��
R�Go,�Z������>�g
�M��X���6��/�1`>)���<C�38���|��-_�}�?1�����f����h�Lp�"p�_�������������/	|l�
~��:���E�<�~��cU,p>%E��������Y]pd�*\N���xF~���Q���0�� ���\H{�����qS."c�:�*���]XzyRril�D��D*�)�d��!���[���khy���L]�U:����d�����hwrx�e�HPl������[�
��*.��-4�i���5����U���!`�!K|��������1�D�u�V*���9��J���q����6��G�hK����Qq����gC�����u ��	�{/z����n�U�SKn��M�S�/\xb|/�]�g���$*�I��z+��$/��^��p|.� ���;���
�j ��~�Y,g�NXQ�/%<�y���;L����z5�Z���A��L������Xo]��g{��
��Kwg���'v�#p_������b*�F�
��B���~�X�����-v��<F[?p_v���[������)��;T�������wAP��������q�
��
O���B��p�4X��-;�m<�w^Vw�K�!����%?�1�Cr1�7�|��L�oX�b�+�<n_*o)�9�I���$o�%�{���>\��K�0thZh���z��m�_Q�������;��>Qw�%��q��^!w{�4�7��)>O�����mf��']������ �S���_W������P����f����5�.E���:�E:��"'=�M��J�p�}�<_��W��������{��2
�[w�4)���U+r93�&r����}��|g��u�~
X��4��&}���K�=h�K�|@�r<Co�WM\8�N_��Oz}'��oA:��p�7�$q�N[���z�'q�m�%7��������f8�Y�	Q�`�V��g�u�����'R���B�:��z�SG��j����[���������%tm�P��8��WLz[�7wV�3���]�����D�Z_Op��B3�����^Xf��y����x�erp��x;���[: ��l|�������������u0���������{����J��@������������i*z\8��!�i�'Vl)���G7
�"p�V�����U
;R��%��w(�B�GI�7������\��t�����n��<Eq��P����y^(�����>��9[�Q����������F �1�'sc��Mz��(�(���M75��4�``	����s�"9`o����i�ya��*�-w�Ql%8\�;��}���oH��2��-������h���UZ�Mh��}8�`��(���i)�v:(eb=�~��bw[*V���Q�k����W�l	������X-e~����h�;��3��������?RA�����-�������wf�������>�bK��,���#����[j�#��h�R�Rc@�[�
����.�Q!	7)�o[��)��_�^�59P	3��N�;Ev��]����PJ�n��h!k�]���h.x��,}���u��|�����1�X�|���,��$[-�����1DD�[&W�����'-��-r��N���L"{k(�&��1���h	�-vW*^�+�WM8
��D�%�z���	�Am�-��������e�oZ��U��[����E�����%gX��7����~����&�k�I�#Q<�������
��$���������n�����&�oqf8�*��^������l2
1�%�!��
��0/�:Y�
�\Z�3C��c��H�����}��l�O6�����n�&kBi�����`}%szv���A�'RE;�1e��{Eq/_��3���!�[To�o���f��S�����
Z�^����g�}�
������8h���S(����T�y���>�H�����������q��U�%z����%�eH$p�}��g_�y����z�����������s`�u���.��$i"���2�:����I���(��l����|���W64!.J(�)R?HJ
��hC5d���|?����_�Z��i�B����	��pz<���%�z���8e�����kS`��>u1��l�����(�!�?�Uv���H��_��*M3�q�p�����*������>��y(��9�����0�w�P����+�����7��9��:.��\���6��.�	������B��$;�C1����M��������@�<�D�~�0�O�M����d_m�������5��vI����?����~	S�f���=��W0�kT��H%b_��R�$�����h)�2�7����Y��d���~��
x}�K�u���S��(�cN��/�C��
��y��=������i�l�M'���2�{}\^i�:7f�����@��/�n+�h��7���O}p������9������[!;���H�.�+��n��x�o���dC��>�-�"�N��'�r]�����ah�58�� �]�,�[�{U�����_]MkBJ�)wt�	�:�|=r��/��=(
�������Ep���q���
��q!���l�3��~�;6�r��H������Q��E��A���F���g������!C�J�L}����V�N��v�G-(������@n��������N�FP�:y�qx<�}Y�!f�����Ab�������~
l�9�\���5����!��G�@��������AI�\}�����<��&�Bv0 w&������>�����T�h&O(3#%3O�0������t)C�knOyf3�N��4�|nH��|v+��z���J
�k~�$,���B���{�t�X�0�H���g���A��St��x�����<��a$D/���Q��������5Z��	�}���p�p�x@<����^YJ{}���,����w�$;�=�,���-��6���|(�	v	�H�{�D���;[t`I�y������V�1�����ec�x�����#C
����*���5hQ�8O��F�N�'3����#�&f��*��Y�	����[����5������G���.��awmR����t�C@x�H�S(��%a�����ah��Y���a�����6q�t��i�o5h��0-���:�������A����/3�+z��?�L�j���w��
����U,=�is�IL�M��)���9���?�������P��v�{����S��(��%���s{������0`�(���ZMIo���4g{-�[��?ekz�1r8'���:�>�~L�i�������7h��Fqd;��[�!V����[������n(������u���/LK��!��l�	g�����;@9���C��?�;y�l�G��H�I|r
����~82Tw0$j�p8�d����p��m'��j6�@����pfeP���!�TM���H�(�Hg�����5�]����Wh�JL��Cw���@�=����V����V������C��~M�����5�|�L��F�5�/R��I{)�����<��Dp��������Z;���B8n�ct���4��#��^�B���@�i�}D�~�$�I�a7q��Aa��4���3bQ���1�J�"��O'���,OE<F%�^G#+k�l[��f�
�K'(n!q�{hU�YF_	�r�g������b�_�������t�Ox�l��I���6������
��������MP�x���6�Z��C��j��jB�Rc�vQWZ�7&�>������=#��K� �C��Z���	Y�Y����c-�)���(h0����eZ�����������z��$P��W���B����>�]�YL0�u�k�C�+y�j�����,7�`YVTpc�����}m�z�v�R�t���3-+�-�F��![�z�����-����VO�2>X�oD_i��i�9���5����.��������8snZ�s������W�LS���>S�2��qd:u�wdv�&LO2�7��9������G�tY��Y�,�~����2`����������:��f.�o����+��>7�g[������3�Z�%�YX��"��,�������5m	%II�DL��#����Ly�����m���./#��~����M�.�e�05c�rt���"i�c����t��ShAyBr{���#!6����q�/��k2�b"��~Y�a��*m�E�D����V>O��	�Y�%@D�;��P�F[�&�1J[�o���Zc�� ����F���(� P+vOhP>�EuHC�4���0�T�-c5�c�����������CY��~����8}�d��-b�y����'���u����*��/$��gr������6_ =���
�D�9/�[0���H��3��x�����
�bb
�����/I�����������\}^d�����y��:q����/�8_�q:{��:hF�3�e�����Y7�}��o{��{�NT��,]��z�����?�>o����o�V��i�G��n�Kb4�����R�O�Vw���j�s�%&��L�x����f�!��o[��!�I�I�y6�%wv��h>.M�	[�0���`�HW���X�R[
L������� ������!�1�������
	��� y�9s��S��QH�HWp�0X����4�d~^�j�a�3U���1��Jn1���d+��ac	�j�x/�'h�@�l��Rgfe���\S
�����w#o�+pXm��\|=���*^�pX��0�U'���HK�����<�<	l�a:���_pH��xM�x��e�X�=^!��5���GS����2*������3���!��M�����(��.k�]�Y����d��!�AKR@E������&����T�"����>/P��F_{!"�OT�J���;?O���H�3%���]7�L3������@(Z�,;�4f^F�&��t&����;��[e�F����m����XT����
Fo��/{��F M���jR�r)���Q,m�!;>A=Z����w���M�7Ih�I��@�!�f�n�$f�=e>��7��
���iL��3r ��������}
�$�k`mN�1� ��?}��c�����m
�Q�w�?��kt.��'��R��2�kds�qJ	�����#O��6�0����e�o�M�q��b��'���dn����-�%����������|A$�� 	�{��
oAo�l��_P�v�o$���sc��~�J�}1.e�do�T����b�O�)�����U*�;C�}��'�0�i��/����Q�����c�-�~���	v�pa�F�-5?Z��	��6���ZV�3����L�����`��	4R��3M%�������o�~�\(d<�>���$�4�c�W�7��m����b���>��/��`O��4�2�����g���b�4�G��h�3�v���$?(����9��J-��*��h���hNTm����%!���R<����"Ra�`��)�!�w��jC���l���~Z���v�kDi�a7x^�>�'!g����N�����6�ua5��(DJ2������
�	C���H���������e��X���ti
d-��������8\M������ Pt�W�WA0
�����d�{�Eh:`}���~$^V��vNKV<����c�m8_u��Z�Q�_�����X����e����"V3�D|�8�C)L�����m��k/��<t�M~nu����K�_��Ah�f�+����k�k<�����)��������N�Z����������������4\�[��������:�W���s��q�h��+�z{5��`�7c��iA["���O�U���s������==��^��&��6C�N�E<���e��i�c
|)|�,�S
������<�@,�-���S���w��r6{��v���=��e|��xjaA���K-�
�U�6�u��&�	���&�����y�����E@D��kmV���M&�|�����u�����0M����|���g����]��W"R��������:�l.y��@2�����{��6m��x�������	�Z����CD�.��ZTxg�����GH������l��6�Z��y������z���#�����O������h�����+��\|L.�����������_U�8�AI�����;��]@Rh3��������r��X�����l<��1�xz��H�^�&���l�k�I�J��G|9�Uuz�c)���'_��$�}FLR����2[��.����vK��h@ik�B�F����=��7V��n,z��:LI^4�%�����e�����xUr������>f]8�:���c���;;�����e�8�	W�q���b�|�k�8��&���s�G���H��u�t�)uQ�]��r)`7�T�b�q��q�*�����H����m��rv0:��s-�B����lro��'*�r���2�����'��7��K���,z��55��kFT��i��7X��/����n6��f)Y��m����G������kE����@�p�7���JkJ�*�@���@%�n$��U���=u'������{�wo�6����_���
YC��k�G���v�@��D���������{N:�3���A�{N�j/�JM������y��1�O�%�3���a���qkb#jk�Ci�G��Ud���M�H'A�'�}�N�'�{�t��c@�����j��?�����Z����`O����w�7F%�@
��n7�=��L&_]7z�+d[B�Hxl�~!��X%�8K�{����%��Wa
	+��	��h
hwES�:<��q��c���{�{�}8�*m-8,+a�n}Q���%w���� A�%���'!�>y��A:bt1_�3U����?�
����V��Y��/���RY&�~�u�L8��f�e�)�:���kZ~J���5N����B�>:����t���n�-B�������;vi@�����]X���D�����W	�z�02�z�t�
�3�&	�yf�x����-9N�0��
�|\��Y��I���:���{[�+!H��m����j��/%��&]�*����iKsE�����<zT�P�-�P�-���kq��[�K�'�
ld�F�R�#�����B������|Ed�.�=��2)������� �
�&���^�1J����qK�?�B�'����3j�'�����b��"��f0J[���"��0?y�%�r:*���S�Q	�lQ��y��Mot5k�����my�0�)Z���V�&&����p���=�?��$Xb��vX��u�.��]+�j�C\->���A�oG�����<�f3�������Sm?pj�����{�e�P����X|T�G#�'(n�=����nI���r���S�"	G��I�{H��h����>�����+$O��wB]����a��3���wx�l���2��r[6g�L'E���>b�u$����B�1!��yFW������:��/��\��[�/��:Xw�^��7�\��
s��+��@!,��1�g����MD)R0�q���r���
X�����F��j���6��Z�-�$��N���!��X�=,t~b�X����M7�F�(�o�nn��T��������\�$c��X��I�D`Z�xW��A���)��Q�}��cv�C�m����M�k�Wc��	����W5�'H{V���B%N&�6O�yQ����@nMhr���r�w�%GW�G&]'�c]3�p���?��{3k�8�%��:OF�������5�O^=�.#KH��
��5g}�0�=��>��C���|�o�t�p�i��D/�EV�T���M�eX����_��?%�6�~�p�X"<`8�d�7��M���_��
���
Y
����������6���&B����.�O�c|���L���*�M~]A}��pGo��tqp�����{v��U�����
��h^NO��:)��_"���l�r>l�K+y����ID�#
Y+.%[O|�&�q��T�����,�>�qWC�
Y.�_.~�#�[
EQ���;k���_6E�l�>�^��'S�<r$��y
:���`���P9���Q�4B:U�x3{�k������C�qG�����f��O+W�����SO����RB��Sw���<�8h�]�eX�k������������8������3��5��7�����A�_������c�������%g�
�3����Z�O6�al��3?�*�}S�;��������Z��-f�o��&;�� Oes���Cr%���������f����A-�����`U�s8�?�������z ��e�������*K�f�MwP�����@���>G#�9�6�8�������np��#s'~s'O��np�D,7��l��-$��f��7>B���T��Oa��u���8�!=�(PH�.
N���������6�
�6�F���s0tI����1�p9*F%�DN!�+����-�#A�����T�������~v�I�H�5r=,�[�n��]\ �A�}P�V��SC �0i�a�;�����~�d���A�Tt`������"���].���<�9��
����gv���l�8��_�0#r?���qxv��|���>�x�������z�cH[a�/p�;�R�����flMC��
"�-v�������f�����cE��G�[v���W��7��
l���2����
��.^7������+��B����6='��t.-D�i/�T���������zSQ�3c�{�H_9���S���bPS��%�.=�(��|Z��u����O����_W�)�m0?�J)s��o;��)��sk<�����[��u��H�����;����������]ww�n���l;�9�N��W��^���=������j�2_�T�g<	k���:
mZ��3Q?�G�;`���(.�wb5{�h�
�_ikX��9����_8���y~��u��}��#[k�?��}�b����F�R������e/��l?[��v�� J�xq����"wt��?
_�6"������\�D���3T���x��P��*��i��_1<��|Ap��N�O�}~71L���"?�+����|��UD��C�4�����
��!XD+����f~��+��1���_Q����o�[��Q�a���y�W��7�L�g)!�m���a�?��s�T0�P�i�_����i	se�/W�\��4�W��M����F�-�������zm�����.^��H��>Y�L��^�0+�X�8����D)sjG�`C+G�����o�Cf�/v����y�q����E�y��s
@�z0j�E�#��m�!��e�\) � �c<��dH�>xy�l��X<b��T�Ib��"�@	@�.h��K��k?F��6���R��
�:����~��@���B�����,�>>��>/��R;(�[���m�R�m���������G�(��jx�PQ;zp�s������1ioK*x��9�/g���[�`�f������y���=���("���U]df�B�
�]Q��9_QIAs=_
��3U�0�������������I�7���f���2��a����	!	����!��f�I��/�k��u�C��	�S�
��uI��b� �'�]�;�V������ ����U�����/�G�u����`��9���z��X:^�R�h*���I��)���#@�{yS����nZ�q�\��y�d�I��
�e�G�:�@�R�w�%������#�B�{BW4_�ej��������vZ��4<�~�_�?@��6,��A��o�
�l^������"�/�+���o`l)�/V9dD."�����\��c�Dp�E�?�X����Gf��e�������$t�_~5�Y�J����i��{��)I�@���g�yN{H�$p����7�{`S-������,@��������:
x�����������5$ux�-�{�����l�
�w[<@�q�P�N��A���u�8�	'|,;_t">����y�_�*��28��||X��N�C��?�:U�������u'��1�<��1i~��R\-��R����(�"a�����������gB7�]!�����g�S�����P	3�����g8ya� U�i3T����w�"�������[�}��nh�L����4�C�,
P���N>�a�x����j�3�Sz�XD�n[��_L&�j��=C�OA�,�bB���G���L<�H-:�q�{8�������;_�4WauSQ*i�9��#�3�������l������������2e�0:��L���y�m�XblG������ui����^jq@�|���>���]�86u��r��8$Z�!��b����Od���>��X�����8�xm%��7���Q-������A%����n�#W�/Tn6�����Y���X�v��5�YI����.��������wm���t<����xE���P��������g�S(5}&��(��n[�z�%a���=c��V���������2���������/�����@�����Pr��L��,����� ��~�WB����6?�e��|[��mQW�	�K�)�k<���n|�������N�[��FD~Z�&;N�oGV|ph<{Hr��>��M���p�h��>��XT�f�����c�CW�>����5�Ko/h������uS7����e}�@F�X�s��smj=��&%SY-�v��4?����������Fh�5�sa�����8J���B�?L�tA��)�~Zt���.{j"dF6F6�dP
������W!�R��$�J�,�\m�jH5O�����wW�lB,.���"���,��nR����+3~h��O��&�
Q�������c�"$�"X����#M���*x�I|��J���ec����?>�)VL[��C9,zY[[�\�;���bCT0u�_�x����1���/B�~���\~A��G�1����([!��}2J��av��u����O�|�S6GF`��%/HV���h�2�p��7��5�\w�67	��E�+��>Mo����G������P���F�v~%�}uk�6�y�6A;�^���?�S��f��M�1X��1������#i/���-"e\�3�w�����
�hZ�{@#����Zt|7�"������,W�X��w>����.^���bs�F<�
wxG<��f������T_>�����9%��h�\cH�yL`��(�R���m�6�@GC��M�%$�<u��b�=�_-CP
�BxR��!��K�g�	��ixc��8{>y�AQ�?@.u�����=yA��6�eWR���"��r�"=r�qW|���#�a��k
����'��T���f�cG��<�,I7|�`���tx@��E�����q�����S}�N)��[���`����dpG��
���`&��r�PQ������� �+�"�9z�����%��r��\?����Ds�R�3q��
�X����f�o�~���f�e)6MA+3�Z49.���d�CA��:�_�x��x(��t�����=��	�f�LF�����	�q@�KwVk�t�F�4Q�hsx�=�se�(�3��s�(��1��D�w\������NY���3��(�]��H��xv�Gb�xF���!�1Q��2��G�����Ml����%C�O0�e�b`!%@�N�#�Sog�o��ja�'����R��=�3�w�>(j|xSA������vn	�X|v'<j�|���U�����T��"�+�z$���7�fwN��q������ciO�&;	�tz)s+���c`k{�1���clkw�
X�(�|��KS�!O���������c�z<�^�����:>lb���|�%��:��<���HKkyiq4��%������c�m����Zhc�@} `a����7$\�c����I>�)�o����W��>��r^!v��RRI�P���}^u_'�$�v��g7� hq�@����AA�jr��XB3w =Jr����'E�.�}�"ceE�|�<4L�AG�L�I^����n���
S/�{+,=H>�{�1�L%R��z@$����#8�>������h���|@��p��c[��0s +�����O�!MI6�������J���Z�K���
7�
�r+2`o�rU��r^
�T�A$�:e2N�xg0!<�x��a�+��3�q|�����+gI�����v1rw���/��)M��M:�������+�\����V�������	0�x8:fEu0�"�����TG�E�����-������U�D_���'z���5W}�G )�WX�A�����6����u����'qxd���hTd3?r�|=-�$��]��n8��G�Qi��g��[�97������`���Q�{l�n�A�y��9�
���A�����/�P&��X�}#�\���Q�$+)o�������I�N����I�����e5C��6#�������+�������z��|�� {!D�nX2(�8I_�K;C��3���������G����k-��$���n�S���8��|�z.���m�����)K�!;����58�9'��%�B�o��	�}3�+X�Iq����"*j.�87�9���_�1�{����!����?;uv�`�h�����E��s���z�M� 2����Q���e����w����~��3X���u131���|�*���:��.��]*.[�G������=�����E���&M��eE���'���.;]�>d�uz��vnS�����Z����r���T!��`5�./���d��FD��������:�t��#���D3��
�4"4}&�U��M���D�����:&pd��N���zx��|�����gGs���.pH�����/
����cU)y�a�4�)�Z�y� X�1�����C4�	��s&����q����(r�,r��H#�� I��c�.g���Q�������G

�D�a��'��^I�-{��G6�_�x�gG�g7����.��)�	�@����	�v�p�^r2��
�8���t��#3����lkE��&��Cc�8=����2�{��X��O���a��]�Xk��Is�D�V������_�}�����YG!Z����_����2��:��^����u�^3s�;���q���:9��e��.G��*`�c
)5Y���N�Uo��� !e�T���4��T@{�Q��b�������@����Pf�xZ�a��E���Zi�;��M�g|6�E�h&������Mt4�������=����4�����@~������m�����'���������7(�6�?.���?�U6�,{.�B��]��D(���bxo�k�tN�_FL�n���\�����6J���������'����}�������1<k�����O���}���Bo�9������mq���fZ�M4�x������oMtV%����3�gM�8���g�Z�?��?��U����<#0���r��:��ma��
�~���
[����r�J����J	&����o��#�8��y#`�a$
�Lj�����������*�"�ZCr����J��
��C�d�>g�`HD�/
��*�_~��H�6K����t�^�����<��XyG�2��2^b�v���z�l�l����Ej����4!��i�|�CR�<��,���#p5YY"�5�u��z��drN�����W�j3��q�vQ�����LXUvS��l\H����;��	���om:��L
(z�������Z}�3"i�1&_v� `�B��g�		��^_�XHn1���GNT<����tQ
I�J�N:��?�qHj��b��8���v�;J�d^0���6�����`o�������5G��>U&i���qK����p�xE�i�P�	jW�)`�"�4n}�S�1�����b~>�GS���V��V��F��%Q�F;)��X�����^2c��2:���h���}+�7��������+��)�7`�*����kx���	�B�I��^�(�S�n����-2l���g�^����D[]zk�H��4��x�3��zW��n�j�
��i�rZ6�(�������t84t]>�H��f���.��c�YB�!?��B���hdF���T��?:#h�{>����(AD�je��Yu,��a,�(8Q��~�[����x@���v�eL|�M+C+��q��R�����3w�|X<W9m�Fr��H��_*N�]��@�N��4�������,�]y��A]�o�yD�MhLx(��*V�l���"4~7t��	%�1�5�MV�<���p���h��,�|~��]
�����E�������4���m������/����iOa�7-�l;�����}�V\2ud<����x���&a1��YH_�&0��(��IIe�)l�RQVp���@#g �C�c��Hb,B�g�U�1���^�Q��(Z0@����d
+��/�+M3���[�D5��0
U���'O3����$V�xeT3jN!��#��St?9����v[�TY���~�K��(L��;"������K�*D�V�����'C��j�5�,���s�I�|,�zkb}��cri�%W���K�9Tx�?��-��
�7A��t��3�	�����~K�9� �)���,X'E%�`���B���7�)c�2��0�oY2[��r����s`�T��x�)t�
��0A�`�I�h)���6IP���md��*�Q(f�En^q������������_�0����`�������0���%A*��?�F�e��D�m$��������%�#�
C�7���������C��c]P�-]M��C���$D��E��2�����L����2?*`�
����n$��4��N���NR��}���E�E��uUw�����^�BP���*L�����*]�8~�~/|V�p:��bY��X^�UE�<��8s�g� %�,&B�`����W��tu��O��,s���V�Q��$2�$��3���mi���}������_q���7���S�&�I�����UD��Y�l��G�|#�j�=�V�+^� �lB4BA�5'���
���^C�o���pj���P����u��/Dw����9�R�^���4yJKaV�Lb�*L;�35���R��p����~�����
�����Yd]"�{��.��C-'1W��=��Xa�C�>4�u�9�����z�9��[�c��Q���.��v8��?����E����������&�22��E���z�;cZ�6SN$�=�*��(��kJ�9W�I45F]�5�0�����D�f����	{��[T��XuU!�1�~�uU�R��1��=�	�49�x�k�=_��}4����Nl���C�<v�q*�����s2-�0i�J�V�7b-��������n4�R�!'.�-�����V���NPQ�KR�YTZ/���17C�~�O=~�����V~r�@#?������t�
�����*��������i:)���d�������IF�>g1j�9Um��*.��O����%��3iVuW��T�r�H������0�$a�����|��o��;V���.�p�a@%����A<�����:0u����"�TK��\����k/�G# ���2an3���L���T�e����k�Y�&,,�V�E``gYV)����L�U�
����l�-+�Uz�g�+m�G#�u���T�J��|��T'�y���x�y|�v��z,��~����S��Q9w%P�*����N�ym���`�kK���.oV8f��J�G,_�tj�_�!r�4 =&��c�@�M��Xe�{��t�������P�)fn��*7��V�P�V��K	n|�3�
�Q-��8��|�K&/��������v��d�ma�1J�)#5��������������SrmJr�7�:�����&��UR����r�v�+�(���w%��i���J��V	(�#�����a��xHS���Zg�
w�:�E�U_�tb�;2��q���'�F�0pNa$�	�\.�I�F4?.l�W��y����g��z��X�&|uA��E��<[�\��
R�c����>���-��-����5	�@�,Z���p���gU��9r�;,��v�t?�lR4M���>������r�����I�
�$FM�o�G��
�Q��&�	v�>��xR������/�N;ea*��JP�e�,�N�F�r�����2��Fe���N
*���G�u ��'Pp�����cG4X������@������`�����pPWmMa(0H�x�BVZ��YM6��2F-�'�c����Js��5�,�V"H��jG�jV�V�R�����xVv���N���Y�
��l$#�E���XS-*n�V��)od�xEo�2��N�G�z�)���ONo���#���9NV�XT^�����x�FG,;Z���y��/���^�"�W�RP�zi��#��W�+4n"���� 
M<����[������_C���R-�6Wow�z�e�8���l���X�CV9:?�m�G8>��5E�#vFj9�D������,�C���h.j~
D��'��t�F��	�>}w�A�(�Ii��������:�����+��QU����j�n��;������Wg�j=��1h��]�j��p�G������+���(Q�(bc��L�2J��(k^3r��F���5���
��z�j?��l� *���F�w��dt�0�g$��FIWE�"�N��]C���L�����+�#�����HTl���){e�/���|j�)�	�m,C�\�G������)E� vv���`t�r�L�E�4}��V��5�Y���?�].�:+�@S�;i�o[ :�23:�k@����"��W1D
Z���V��a���#���'v��ZM�m�/����C�������?�,��U-s���Z>B�E^���^B30�����U$��uXy\n���L����.��P���L�%%�Kr�(L��rYQ+�#�?�2d;��bpT�
Vb�����~���P�`fX�_��>�>2o��p��k�M�� z�K�����jU���;]>���8��v,����s�%G�'��P�,�vBS�]P�,@���|j�o�����z���%+�*�N��u<
_o�0���'A��IU���B ��g7@.�%"E�:|l�:t>Z��&`���:l�j�|�����@pQ�8Z�ZA��[�z���BcSWb'rf8�<�^Q� ���UL���~��[��bk��Vh���8M�����UL�����z��%b����E��� �G��������Jx������a!�RG����rE���>��_.���k���'�km�@.(+F�e=���j&;%�{��j?�rn�4Rk"dO!�>E�p�������(�A��f�t��-7���)���#��)7V{�}Km�-�H��x�@���B�8�\���j?8&���h}�������0%zg��w�%M��yI��V���W�����>|���4P����"�q���ie
��O���Q�5��]�pW�g(����{Y!�~M�k�{���� >v��������_��g$�9����%��OhU����j:-��Q��eC<��b�G'��G�~����_���:�:���^�7��Sd}T0S�Q���7���,4b����aRJ�����$tp�R?t�Mv����v1�.�(�����C:8bs���t�$7�Ux���>Ic�����|j �T��-:-��#�
w�2poT��t��m:,	�VW����n���"*��:v$��!�dp�.�����'="Z��6�n� �y��]l-p�~���yf~�����O���c�zI�������ve�OZpim
�]����T%sM&�������'R&���O�.i�D�uT1�?��=!Z�S8�Z�b���Z��(�U�� L%1���57�n�~
�����H�I�*�%���0�c.)��w
�Y��9Me1�Ic�x�"q}���rH�jJ��i�+�GiY��nsW�E)��k����a���f�����f��!
E�"����p�Zw[�g�0wdh����M^���YWg�����9������O��<���f�B��Vj����*n�0n�W���}+9�,��e���x�������:FC�{N=-��rX%��I/�TX����oL�&���.ZP��3f��*[��M�9�P���t��~�SC3[���F(�_&}	��F���)�$�,B�[����/��U��"u�r<
��������xT����x9��r['���;��V,/jWCK��#$�EjA}Z"���-���7jr���������*�o�:������T�:z�����o�K���D���F��T�K��(^.�����@hDc{8����x�p-"M�G�"��9�0_�F4����v�n�x�+k��<�WV����:���u���>������/|��-����-��M��05��X�j
�*���Q�&�qv��
���+�����]�c���E���
�o7KzM:�~NiGX����e�H� �5���}�L�~D��i���@������N+�{�`�o��*jw:��iJ�QX9YN��cN���6�8���tQ��oW��ZJf����~Yu��qqG����te^��t��� �2B��8���o��BTH�JJ�s�Z�����d�LE`��C-�y_�� ��_f���������S/��<��K��i�"S[�I.6�9���Q���0��YDX*��J��3�S�NP���b��^��� ��gR����|�-��A�@�N�V�>�7���6�o)�U��q`�����J;A��`aFMl�	Q�"��^H,�peq$�|��h5�,������i{6���F,=�
n&�PY:f�PQN��� ��4�w���N�I�3�S�0�s�(��	h�����Z'��|2�o����v��^C�gsH��%���}e5�=�@����
T��l~�A� ���&���Qy=.gn�*��	8���|�2���r���38��3n���J�?�����;�zl�B
�Zx�y`�qm8����kq�>��Vj�ty�2��DM|��SJ�����J#��NOuPe
E,-4vM�T���=���.�F!�nN�T���P���y����?v�
cx,LJ1F�s�	H{	|X>n4�����"�Y����0p�,�0hd�-Wj���#���>�~Rb�~<�������8���uG���bm�Q84{���e��0�z8fvE�;u���9���6c�;#��������`�%:��>��|��V9,]I?�U�ns!s�����vi[8;�0[*���rNs^�����/a�k$w����Y�����Y��;j�4���"eyk�;#yu��HajCt���/���I��.�T�4rk�I�E�r��\��C�;Vk:��^{���	|������7�F�o;�=�]s�
�5F$1e:|yV�<|? �C
��b^�!2
�c��eI8I���F2T@%,�{���$� P(tk��G>��p�Mm
�P��;A��\:�l���6�P#���xi�u<C���]	4t����y
���|h�U�U�4��P�)�C�6�X���hT3��O��YO-�<������_t�CY��<�!�$�����d�`��$,��]�	��d�X��������?C$<�~�����0�Y�Fp�1I�V��o�W���S[��o���it����c�C~���6
4v�>����@Z,m0J���vG-dN)c#h�MW��{cKCz�����]�^�mt���E�MaIQ^��&	']�x����/b�}�q��'#?	;��<���i�Z�������Q��[�v�b��w�B|�"��Oc��{���@e��
�}1���u������V���o�t]�ZLdnh6���	P��ho� ��nJu4����L�o��l2t�&�H�A����ti�p���o��5�??�`;��[5���K����/�[bg2���
C������-��t���<xD�c�2�m$=�@�j����:�&%c
!��_�����\=����Z��Z�|�B�Y/�����67v�N����n�q����9���Cj|����7������n3Rf�$��.s�-\g"���]�\���}�Ojz��SP���h��t�[]�C���u�O�-�x���}��\�<�&��py dl���~��d��@��������$�@����32����a��R�B�gK�r[���q�y@�}��P��u���)k�������#%�C��n�������������������L:�[��riA��k�����r�������������a���>����[Nx)��Zp8�:'xZ�!���XXl�N�{y��������r�s����|��.��q����B��&�����V��������K������/w�F���{�3O`�=<n����/���[_��/^��7�?`�����8$b������}����o�6>�(�x�����Y�I�/=��K&	xQ�/c��L'4�Rr�[>�L��D1�?�}���}�N�P�����gyA����G��0��P��_�m$d�������'��b`|/1�����q���m���PI1��8i��x��������'������	<n3K��'w�w��o3����n�O����m�0�0�^<9^x��96K��������^�xS4��]�(�����]�5lO<����C1Cu[ey��.��K��7k~3�U��pK�����[��0��\� ��Ge�p�&��2�M����Y���pL�\+�S�� R�0��V,}�RUE��Z��/��� #���\�������������'�l��Ls�����?oO�|s����9�Z�T��qp�d�#wzI�8�'�-��=z���}�� ��`��W2����)�������	n��)���g����\�����p���a
���r��)J>)y���-0��cj�UE�.~��"tXCf�t��4!Wy�DJ�	��Z8U*�"9r�4��h'2��R�|^GNx��	9��/����K1��Qz�P��SW*�-H�Q`�����%�����TB+�%W���+��TR<r|T`��6�B-�0��6v�N����e�����"�p�D�e!P�(�RE(8��R�K�"-R�l�79�W�,*�����D;E��!SK��dP��.��n2���z~�(��Z��h)��d:R��0SM�Va�"0��:�����k��%K#��Hh@�$\�4�$
�?i$%}�h`Wr�8:��O!7�31�@	���@Q�������.2bY��l�0O6�h�E��@��
�^�5�.a������I�{H��W��@(���/A��&��Z�o�H��� ~E�J���B�p���D\���bz�h87����`6��4��E�i�HaB�(Jpl
-�T���5~W`;�L*~��&���#�A��zWL*S��f����K*���M�c��>�T
�K��a��S�g�\@K�w+�W�	�J&�?�Fp"T	�-�p��4����x�Fn�)**4��xmO�xD%LQ��Y)/��p`�o�����Y���y�^��Y)�VZ�
��%+0�YJ�����=%MS�,��Hk�X��|�R"�Z.���g���)��
�W!��l��|b�?z�X U���� }Rn�����AJ���:��K����4g.����)yd}�!�����>+z�|+��Y��pG�V��$������r <�9[�t����'O�m�l�0=�@����Q|�ji4>���>*�Y��3u��Y��m�3Uk��c����$��x���B{L������g�|R@�[�<��
o'�q��SL5l��f��V�]s��?�d�t������,1�,/�4�_����V��{�z�&L�[��8�w@ Kjl�<�
���(��c��G�ZS����Wr��h ��j���$��p%5�E\ 1��%�7W��c�QT,[S{L��`��.�4��/4]��d?>�v�P*��#J��}44������.7O�/��Rz������}�</��N+Ze�]�R[��)o�����{��C���^��W�����*>������dq��np�`q}�} e��|���Y������%g}=�so����HoG�G��d�{���<�>��>��
t��hp���=�}mw%�wub��pw5��%xwyY/�8}p����uBas�.}{�t�� 
+i�p\��+A� 
���To&d����������1qt�3���W�Z�::���	�����`ha�����/����XX�g-��K/��\h���ih�����`���jc�o�b� p00h�{���6q4�OR�J����``oI��^�Cjh������~?��9��!WB
&b0�0s�7t��8�098::9�3�Oz���g�?���143q`�p�1�s����)R�����������������8��Z�L��G=�����X�����k�'���&Sk[CG����W�U����B�����Z��m��M�����?���&�{������2�?��M&���`aj���h��������Lh@�Y�$�?$����dEU�0H��1���?��������3�8�����Q��	���������7>����w&�&�����o�`~88�@��9�����������������������
�����#������$$�_��������GA����c������������s�ar�c�5t�B�����������M�9���8�S��{����U�G|��f%%�*�O�������c� �"������������+666^^^iiimmm###kkk������������������������������������������������G���'���``���o�����Z�s���hc�Y��y��R���_�U0�R���
�����"P�Pp����y����^�:����2c�8.t��'./���\&��E�}-�]�S��x�����B�9@{���x'�wh��:����7��=�r��Fz5
;���; ���~Z�\N���y
��`�����R��O b�eF���pJ^���+�g���GxG����$��*i/>!����\�9I>k<"(9�I[�O���,N�2���f:��U�J�����]�s2���Fq�W�6O�R^+�L6
5A5��TX��5������4~�����'����e���]�-~�X�b+�C�`�k��������ve�s�0�]M��E�RQ�f@��jy�����Na���v��3Mr��r������Ev�V',�M?�Q��OxoB���3�Z�-�q�;��&X���8~m�!�U�����m��]�^o�4�9��m���l�&��\'[�%fe��k��E��	K�%mJ*5,�?��m���
���Bz���agf8�[�7sF�4*�7�YXS2U�lp�h���������Z���>��
b��>W��.L��3��DO0����P.��e�w��RA1�.H�'�r��'�;�C0�Z?g'&��N<�������"5�z)SKk06�_�l0��>r)�6��K����w�+����SgI��jb�G?>BS��p=�<�����z_gM�iV����hFHw��,
�bi2M�^=p1'f(��>,c$]���u��C$t.����$8���A8:�sZ������|��y�f��
�Za���(Q_�����c�J�@1��������+�z#^��J�-�&l�;W������a.�����jsg���a������la���.b,��c����~>������[�v��YuX�����b��d��h�,0���@�K��4��6M�;�t
v:L�	�r7� ���_�xo����<^��a5!^A��&�2��vuY�4��z�/jAf79����B� ��I��]|9l��P�f"�q����K���o�L�����R�����|i�& ���:>�C&�{Ybm�3E���Xrd;���4�a�A
�M.�g��|E������\���A6p�Cjo��
�Bv1�O�^��D
��:����5oJ�
��d�C@�W�)xI���n	#��� `@������c9������h����S�~���C%���� ��v��dw��v�����~��+�/�����`�����MIm
�ic3���Q������wR~��I���[��uH������j�F��o�������7#"
��b���s�	MgXPc:5�d�Z�`�H�p�6������1;^����k������ %����z�j'��d)�%�]C�����^�Q��!�=,uF�xiD:u�h�6x&�8�����z���N��{��������q�������%�����!���tXk,�C����F�u�NW�����8�n�MM�������_�&Z�u�Wc�}X{:��U�����0�u���,���2�h�j��&qU����$�_�J�:�k�����p����B���9��j��}��	���Wc��B����AlD���	��n]io ����t[��8��4eox{}Gc�����*�2lB{o����e�h�����ub�H�f�6���b����<7|��",���!�#]N{E�}��^0�������r�kx����oa�R4'd�rA 9I�V>�M���p�A���5����4������A��ln�a�H���
�����$������m��5�U����=%�nm�h}HNe�������AVr����$^��Av�!;�X���)����Qu��:��u��1n���,�~~nbP��!������y�f���8v��X���z�=*�����?��k���NM�W�����N%�0��p��M�9�jX���`Z���#�=�u���8����L���vr�K���u���Q����BR�)Zc�����#��w2�6�G���,1&��L4R�g_�jz��*w�
?X��T0�d+f�o�
��eh@[G�4t!�:4�����s�%��M�����o�O���g��9��^R:x�Ve����������~K�r)gLi�)�:�a�T`:�^�g���0��UJd�CO#��+�x������~�n6	��>�ux�Ef�A�.�"\Xz�����`�������m]�����e���rd���1��p�h+��>�����Ve?��d��j�N�"�S*fj��@+�K�)�~����F��i)��9�����������^����m9:n'���Q2��l���s�Uv:��%�-�t�@�@0N\���q���x������*"��w���%��MC�V�u\"�r��6>����m�4��L��Aoq���
oj�'5����Yv����.�6m�J������![,�v��,��4�����NK��:�_V������L������dZY��~,!�n���msf������������Fm-����r�����}U-�c���%�u����o�u����&1(��ap�:�G���O��:4n��GN�����/���/#�q��B=e�S��`��~��4@�2�
�}����z�o,���,P�O+d4�Ui��
|��������
��vQ��yu�BH�}����N`�������Z���I��bR��G�&�Sv0�&5d�N�Y%�
G]���#JdHA1�82�8P�����>����6n���q�B��a���B��A9��sVA��i������=���Q����M{�T��m+�Gye��f���+S�q�&^?7��h�j��[x���O����C��A�:�~����x��"/�m�����
� �y�����T����e��?4<bo�L�}|;���^/Z�s�t����X�f:
T)�S*r��L.���.����E������;����@O;R!�i�T��_)��t���k�4���������x%'������?��Z�FY,��U#g7������~g�c�-����h4=sZ�\xsS����q�Z��:�F��n^����J97��r?����8�y���nG�'�Er����^��cVm���J���W5bvv;���������X��h��G��/0i�'O�zX^Q=?�������:��Y�P2a�k�_S��4��Q������W&!��?^�F�ek�R��}�b�i!*|��j���c�lZg���Gw���������h���W��4�I��a`.}���4������ec��`	9�in�j
�G_���W/��`���Me�O�s�\!��i{�^y��7��2e�\�C
���^����`�
r����26��|�*s��6�*���u�N�<��jeh�������t\�@MA�[y��'
a+.-��>�j��[5�S�7�����Q]
��:=��S�1��Qs�,]Wq���AWm\���E���������f~�l��|���E�O��r�ih���t2_u	�|�z06�1�%r~��m���|~��77�0dd`��>o�Zn�0Mp�RY�X��F��L>G<�I�db�������~�c4VM4m��������v/Q�"��+9V�Z���Q�S^���Z�������������;���Q�Q�D ��s�]��>n���j��h%��d��E�a�<Y6Jp�4�}Qf/e>ah;d��	��i�zU��'���7*��]��w[��]-���y�e��������N:.�X�����i����}	&t�U��'���zg�eI��x�#�g!zg_:���y��W��N���� W��${fH_F�br2I`��e��L��L����j��Q�4ie�4nU���������}�����CR�~������HY�O���k��I����Y�����������6���c��-�oW�2�R�^�6yz��@/���{��I,l.��?�8Rr�5���sF��X�����77M8��h�Kh%�S���N�J�����p����4�~W�R�4�4�q��@��*m�'x�GOBu��-���7�����"��B���[+=�}W9Nn���6����l���<�<e�X�O�.�r��g�s��Y���

66���iO��������Q�BT�5��}�k����J����������x�����MZAf���W��E������x��L�Xi�m�s������+�1e/=|��tn�e�9�q:��}p-���>�P�G��;E���;���r��4T�2\���=�Z�E��F�pS�Y2|����`�-�Lkm
��{z�!�&�d.)��HW��w������~����?:����&��V��tm�1*�Ev/�{�J���c������2%�N+q�\�����jZ�;1�Q�$4��U� u�I��[��� �z��c����+ko��Z�Z=�E+�!�!��J/`V����~�����h{�n���9�F�Xa9a�u�Q2�@O�l	l2����
�^'����;����kM��B��L���c�Ys�/������Z6��FA|��7�v�+C��Q�&��L�:��>���a��-W�r�-�%�2��U"�3��-&i���j������`�	�a0�v�O��#\�v��b�=iEm��;��B�zf!��&�����6�#���v@�J���Q���Y�A�s��4`;Y��aQu�wY�gj�i���J!�0��mN�Z}`��4�b�sD�=������f&U�~����4d���a���\���:}��{��_^���:����P�Z��!�G2?�w��-;Z�V�('�\�p���R�d���v�c�T�U����f��i�H�w:C�0�
��k9$�EFIQ��e���A������L�}��@K�a���]��:����G�����Z\�F���V_���W}(�{�u��
�������{�&;�����^S�0��y�-4W�������n��������w��3�o;T���n�fG�cV�0_�D���^�n���-�;�/�|���
?
i�U������M����N��(��^}(e<���b�_x�� �f���H
���.e���aw������_�x�H������V��k����Nq�����65��N��1c�2���{S�xN���W�x`�	����x�P�Fox��������=��V����<y�M">�D��}������6������qR�HP�a�����SQ	c	�(�X� b��wt+R���S��9��JN�p�)�9b�j?b�qo/:�f��A��n�[�0��~�M�I[�������+:��-������yd���6$m�b �a�����e�p�����G�l�������M�QU�I�?�_6�\�Y�,�%�q�Av�����{e�s�=�]��4�21�]�QtU��N����F��V�Ue��[u�����5%��P�|�b���Z�i����%�f���55)�rS�Cq�(�O���7������}�T��V���0QjgOa�Id���>$
�0_�
�$�&VH�g�G��T����W,�g��A��6�t���%	�1Y�^�0U.�����,������#��
��
�.��o`�I���z����[M]j��9�j���8j���7�)���dUU	�)$�p�����W������g�����D��I{��+�C��r��9�M��R$����Zn'����mo�i��?��B�b]���1u!��E5|�&�&[�Cf�
y,��f�L�!�u��b�������U&<�1d^{���v�������U�>W�2#�n��������DOE�<����L�,�VI�B�	��9M,�o��2Z�N��p��)��z����Sm�<G�����J$nDbTOy�(�*����OQ�����l�Lbk{�t|Y\ThMf�q��`P\,u��$K(�EW�9tk���?a �a��i�-~�s^��x�S�lY���������\�w2�����y5O�t����X���+�AwQ��k���d
YM��{�swv�5��}�=�v�:9d�
��JKU�(��"��F��6�e��<�c�//U��5Y��p;���=j��:��G�i��I�`�P���?���,8p�%���EE�X�C��U�$u�]�0�gx��F�y\eX���e��3�.ybj/���~�En������(�jc��3�bt�I�_e��n����=�	�T��O�z��%����U�:(����Q�IQ���%�R����}]��w�N����������.��g�������8�N�j�YD������������k/Q�w\��gB!=\;�.�����8jq{���Ayu��-�N���D���������T�\a5�%{V�u��B�s��� (t�I~���C�
�����������>j.]�D���KxV�<d��_�7O?.�����D�������3�O��3�,�R��p����3{#+���Eax3�1L����-&�����S���lw �V��MF���k.L�[�X�U�������b��P��/S�3U�[��wPs�� 8s��P��(��E����V��F
�*z�eS��9T��r:�^�$���Hq�_������j��P�,<��Q'�r
-����K����W�J�����NS�/���u�4�UV?�R��&N,;������X|Lf��<�����Ap��A`�r?����_E.6�:���.�=�\������������������V9;�4�����������'��
1�W�,��f���h��x�A.��x�{����< �l�M`���H}pn3�9����_��>G%���35�S�'�����P��o����bP[���9�:Zz��=�5��xj^��qG+������}�(b����o`Y5(TG4~�MBb�������f�G�����:$��)<����?�M�,��/zkRw>���?�K�=�����k���D�QH@`{u�| (E�Qk�>rE0�F���+r����EB|Z%�Z0�[����y����e�����W�/Y��k�}����V����Em��8��5�-a�G/�1E=?mW|�4&@�j�_��W]��VQt�n����8����r|�b�b%���t?������j*�$$h�����|��%�W�9���v]�<]\���������c������o����?
��*���c'��V�~�6���:���M��Sr+�=Oa]��$LXG�I�n��rK������;9EV Z.����S��i���%�;z��>.T�w����;���j+1D�ic����-��\#�vr6j�����r�^,�#�d"����KU
j�/�e/��3-j��dV-��6H5o�B��T+��\�~�:X\��E)���7�JkU�H@�ku-"�
F
��J�����w�o���n\�x?��e��+wZe�@�get��_7h��H-ISe�N7�W�$1�����rx��~������@LI�}��>�OjI�s�����P�2}���$01�=��^�M@	3y_1�@R� %�|_����nHt��.������Y���;%����������(�6��{�u����g�L{�e��*�`�;������~�+&��f�i��ox�����r�*��'yE��}����C����{����fj����I�K=��0�u�NO��r`/���0o��M��������AS��k�=m�+����GC��(�}��O���0��ZG���X�Q���}��-o��f���,��5 ��`�n�	s.*s]����Y��Qp�jn����*T�]��Ql	
�N�����j*���N�t��|�0Jipf$0�Y<Q~�����7�Q_��7�9��/D�w���Jx���=*�y�q
S?�{����$���J������2�|C� x���0Y��]�B5�w /.�[�� �O���-���O��������Xu0{�Y&�	�������'�6���z�0������z�%"���R��3'��z�������%��J�������G��oG(�

��ti��*�NhPL�^�(�S+��Tf���� fQ=R~�@e��[���������sO���F�d
1i�2h?�b������.���E�
����������wp�t��y��N��?������?���������N��?���<�'��$�:������������������<��4�������A@3J�_�����=u�eF{��i2�R[d��bN�n��z��rdM��!�����N6c�~	��:�f���y���::�d�����&-�~B'��pu�k��|Y�$�{	�o5��pD��x�Ckp����{_>�]���-�~�ziX��RG�''��@�$'��5�L��3�`���u< ����.rxf�B���Jk��N��g�����������Wq�,�����nO����O�k�5�o�����n?m������Z���,���t$�?B,U���fFY���7*,���|��V���bL�U����&���%����h�\e�i����wh��!�����a64mQ�7�3GA���$S���o?h�Bxp�rF]I��P+�TO}K���Bw�ag>.R������6{�b��<i���YB�'�6���BGg����4$+<'hh�
,y�+�s����TFb,��Vz�}u��;���RO����4�Y�����������q
.�HH���y��"��h�J�����j4X�8;�A�(\8V(<�m>�������_*�����e�/���'��<��|�e�l�����R��8�V]��0O�G���K������+�I40� �+��eV3z�;�Aq���9��������ruN�e�����C;����;�����������;��i����y�������b[}�������������oOw��W����O��!)W�z���[��G�m�������:�O;�^w&��.^��Q�>>tOJ���m�������s����>S>4����mn7� ��r����n���-����!�����R��d���<���:�K{���yl&����������;��Gw���a���a'b�7�"�v����w&���!�_�k���E�����O����������������]���z�����{��k���������S�������>7��z{����>W������.^'�'�O�O�Q����.6����z�-7$�g�@��������� y��S�����';>>��i��K-��M7Y��i��zm^g����K5{zo����o2��������d���>���\���?�
��?�_n
~\Ne�~�z��O<z:Z�~��'�9_����<�����vR>�t?�hwx0�3��'��������{b��)���t18���]�����J-������3��.9�=�d������;���q�aO������ruwk�2C����`^�>�I�����7�>�X�z��=�_~�9O����i�:1���pb�~=��#"��g�{��M���b��Zy�y��7�go�r�q;�����������iD�n\�!@�������!p{������b�����<D���}��������z��;����������uT�`|��������O��������b� �0y��=��_����&��<;Yg�hi����L9��x%}w����s�"�^�&�����dsHK�O��@�������~����}Zy�~;�g��U���B��>���+������'��d[�%�'��r�9:�����mA�����)���77���G��G�NWgLw|�4~,=>Tg<n%�\��x>��:^�[�{�O����O7�mO �����u�S��w��V�W�Z���u�J}�9�Cr�����wk#�2����3vse��������mt9	�N���E��L!'��6nU��|?�����/��������cO��#���]<�8���-�;��"�&�=��?���
8�w��fx�	<� ?�_l������!�]yw
z��{?�^6'�f�����z74
�=�{�z�]��z|���;�_>�x<^�����t|zW�-8���q�=?�����E����R���xs��h1�zLK�>.<�c�)��>��L����'�z6�L'aOg����:7;O7[gOu������>��\��8^�m��>�����{���om���3���7�������3�����x�<��Vw�3B9��,o�3�/����m�Z��m&����u�{J�����i����["_?��^:k���<<��b�<^�6_��o�_
��������Gn>yj:��T_��{z<����O����Y�B�!��<i]�Z_ul���r�z�8�~0�_*/L;zw����Ki:O��L�������q�q9���Xj=�?<x�w�.���z-����r�xl/%�y�����	.���;6X1������x�;����+���b�Z��+����Y$���B�ec��32�����k�wq�H|l�M���Gn�W���`�
�b--��Z����p�}�8=���>�\N��� ���?�l�bo�>��v�{��z+�SN�|L}��d>~���jD�_�
s^��A�v�U�9���,��L�����7Y/��Vo����J.%������69���3��1���R[o_�������A�L���{y�����������;�����w�9��IcY�V@������OS���#��E�7���|���Lm����O��{�>O�;!���L�>���@��X�!�hm��y�x7I��5s������
>���0-{������<�g������o�p�=:_����E
<M3�?:[M3ywmk�xg�,�>�������/��j3md���zo5t�:_'�!	:������������'n���q���-��� nX|wW�L(x��������p2�4�2�;�)Hu��%h~�D�x�uA����������z��)J��"����
��=��s����>]_����'��2�&�oQ���M��`l[������L9 �������������V���������s����+��;�������sl!�����Vo��}�'���N����}�����z��|+�M(�x�V����=���(�[C���������<'�nU�'7��!�����(�."w����_[�����{A� ����E��������fB�[(}O��>��x$��r��8q[���a��a�}���|�~J4����X����������sT����M�.��iP
���P8��j���7��<s��V���7���1j~U�2��F{����	�F���R�x�._fK���h��7�z��8r�8���8(��o��x���%�"ot�����%Z��`v%�]6��Ze���������`!�����6������S,2�����uHg4�&l@�$P�����������$����M��E�.9�|}�|��&�#��B9���������N�N���1#�n
�~����~�;<���*�xP=���~���rZ���.fF
���Q�,w�l�
��w�l��&:�3���K�8�Eg�����2�o�g�|R��*���M&���7m��
��
m���%���*M���(��n�v���v�3��<����~�������5��� 5���k�v��kt����u�������%v�k,�@�2T�-R������i�6x����6��T�-��G��@T3�W3�yi
��%nJS<�G����G�^�C��6,c�����SD��&��w�9�M�y�����{Q9"@u?!E�{oY6|am_�����@i����0(@��
�8��t����(��E�f��9���E�5I��+��
��o�� ���tW��=���D.�!���F��Q��+��|%����[1��6N7�G�nZ8H � ��Y�
���}@�������i���_(>�>���-�����p���?!��A��2i~6�3}Q�v|���Ed�n�`�^�G�!G|KL�G�X|YA �O����4����M�LAe�0������9����/�����M/�hm,��~�6��@�����;��Lq���%br{���S<���g��R��4P���t�\������h~�M\j��RliM�A���B�	����k'_D��Q�����n2/�	+�'%8]���
	bG@�/p!����Hy���dK+_����G���+2����?z�d�T=wPT�>(Jl����U��\��D?�iU*2�Wi�#G��tf�a;b;k�O�>������
�j=L����^��O�9�7[V�R���P�E��8�c�� �'D�bi
�a��<��6�  
v��(�����>uXBu�~
�a ��(]	rnm�� }�\����6��������@��8kJ[��.����D\n�n��aj��dk�\@���577�I�!EI���� �I,)�Q������X �$������j�T{� ?�H&0`���w/��';c%���<��������sDKT�B����q
�+���UF�������������\��c������2�M�,H:�(��*��Z�D3=3��f+�i*\�S-��-�6�n^�h-�-�����
��� �~���:��_b��GFo��7#���|�c����iN�x���Y��5�{���og4*�c�5y��L�^~���s��
�)j���bL�/�2������H����;K�&"��/��!!�n�������A�����$��!�G������=�\��
����B���\�p)����oQ,T�uwL����ng:�x��c�lMZB��Sk�k��-`7�l�

J����H�nI�
U�g���LL�b5,��������reu��O������f�I5��<�m��� B�W��
�M',&��������1*JQ<�������0i�a-T.G��q�i�:@�O-�-��a�~
su���k�H�!o�G5m~`����G���=L�[N���i$�z�+Tc���$��x��2�4������y��Ey��^
����������K)�_X�����V�2$��2y$.���/��N��g��M��z���9��Pw��gcG.Y�-{��E��3�?����B#;`q��ug#�����1�/��Xc��U}�^�����?&^n]��z�\������E����O
(G^K�(������Tj�U"������7���
E�@e�j:�B��_����R�f�8&��
`"=��nLN�hwzNGc������|���	�fr����e���0	�D�D���@o7�R�"-�i���
��:�'���#D��w�W]�jp,��>��
�M'���# �������*���[ l�|*�R��_��8j����������!�-��cD�y�����x��������>X�w"H�W�;��������s�f(k�Pw�_�z������m�3�=2���;;7?�j�?�b�*�,�D�^?�'��I���J�J;x�:�!u5��aY��mld��B��;v
�������P�!�tK����]��H�-��+g-K��y�=/���k������P���BC�����jAX�(�����N����B�n�l��k���*\�\�	x�b�1K��g�da�gb�'���n�_�~�]�~�l1���wv�H\4k2���>;�Q/����aH��
Wk�A�xY
6Z����n���V�I�U1�g�E7(A��6S�����K��I������[�]��S"2�R�IvF��5.�	{q�fU���"Z�U8�kqv���ME�������A��X��S��%�
}TT@�
���������*hp � dh1���������^eg��p, !�[�O�G16�����_\�������4~��q�'l��m�me�$��+<v����Db��ZD�U@�/��?�����W�
�'����.��b��2�P9-��]���`BR4:%.7=#�d-~�U@3vC��t�_.�#�:<�������Km��f������C+�c������`v2�Y-U����H?#��Q�gq��:t�L�T�L���R���<h��z�~F�<w,����cQje���x�b�������^����t��w��v�Z��� ��&Uv��SM�xP1H�r��u6� �7�L�J�H�o�o�Q;XwW�d1*���?�#j[����4�]�sCU����!�����J��s_���3�f�A�!���F�?-��>:6�Y�Q_�e�IH,� �T?`��g��<����I���tr@�t|�~D����x!��}_N(��:9�-Sy�{<J<�(D>�lLP76k���D�� 8Z��J��nn��qJ;Xp���	]�?KTD�����Wi��+����~�>��$��x`����b���T/R'�=+���1��r��{�wh�rsV�@�)]��n)���m~J��mR�=���5������<)2��L6���y��J]knJ���Eo���Q��4���j��K��L����R g�����,&.	��D�Cp��P�����~��;�NHr_X����S�J�.��JZ?�����,�������f<���^x��,��d�M��(.2U����xIq�x���}�3��3���I�����v4`m��ueu,�1V�;����K��K���xI��u�m?O��6���4xumE6hyK��T�`z�����r�6�#��t��f7a���0|���>��
�Yc�����Jxp���	/"o�>2D��P���]km�����aY1\$����?9��3�k�w����U�Q���f�!�����8j�OO�:��]H�f�Z:Y��I��[F�/t;�A��e�������C�����"N	X�p����Y�qPg���Z^@�l�e���{a���v�cuj�c�m��Zx&�pV7O���{�8(���BA�*�lSp��'���cwy)���"��s�������o%�^BnMi��I����M�9us�
4o��������+�qp�������L�d��A�33�{+;�X���y���9��Quj����s�P�P����n��y�y�N.�{������;���N�*s�4Ko�8�n|��)������j�5-������R�x���0�Gn|���]���<��Y�����ktCt��E���O���/�%v:L�!�O:�s�"%�rd�C�R*n��Y�����q|��0�+���?�L�3���Z��&I�v&�����uiV�n�f*�?@d�R�����u'9g�IWy"��g�Z0gE����&�zc#@��d����7�7r�������uC�y��������%#�I��W��;������cD��9�@�@�[��Zj���%m/��-��};��i6����=���!i!�����-�,�!��+���"f=6�/P[|�������(_�q��I��.�?9�A����� �sQ����lYeb~
��s�k�XoP&*����~]�8(��r��$9K��dxL����o�R�m���!����c���hA����@w�_"���q�,?�S�*(��� �gS)�~6!p������]?��|�W+�~x��.�r��qj�4Mv#O��;�.xc�Q�D�@_8��@P�o���E1��1�_W����
�~�\��aE@�Y���;�$wr9�c�VOO��-,�=��ER���Oe��(��w��P��g���gL�@#q[��7:VL���n=��=�a�������&gI0�7U�������h���z������N�k�
�Y���#CL�8��c�0 TPKh���k�w��H�V���~hL=a=��O�Oi���A��x]�z���|��s�_'��ATw�xp��=4x��rV�M&)>����9c�k�"��F[*:7�������
�an��g23{�����������g�;�},�dT���f�
���������o�6�U�6���x�KS�x��CR����e/�>�Y/�����f7�lf�Z��;�"+�:T��0%@s0'����Lf�����K����C�Vv���m�A�����B���cSA=�Kdy#��R4!z\d���B��`�Z�5�r��7�(�G���su����f��'m��(�\d8�?������V�Bnx���6w.Hz�kS��������Aw-��?�emmM�t)�	P�#��1BN�p�����~�T��G���L��s���k�'�������n��0�����e���?e8,$q(��l�}>t 88��q@��9�'p���-��'<j/���vc�F�%D��n�T��������AD��������d�����ikqB�/��q��X�^$��W�g#����$��j�Y9{����X���7uW��R��?C��W���;&'w0���~�	H>kIe_~M��Oe�+�L��'�����[����H?��sS����bwV����������8I;j���\���F�gf��x������.��C��;D�
y^��[w"��T��.:��@�Pe^F)�Ye������
���T����~�
�o�S]b
����s���7�wi<!�P�IS�"i�c����2��/��C�+K���?��X�0Q���h�9����e�MQR|���*x���h�4Q����7�`~�K�z��G7%�����F��+�����-"����BW�����S�����s���?�O����-�-�����=eA`���X���,h�<a�����|��"�?��bic���=E������%��HO�"����\e3�L�r������F�2#���kV��yi��WgT��?!�'����?��!P�.�\�
��"(b���V-AF�I9\;���a�\_ZO���yk�2K��������O�7�q��3�'p� ��E��64f�2 �^�[
@��JR���R���U�������{��a&�������Y���a�\�
�J��5��)�vQ��7�	-�W5�zJ��s;%���V8����r?��v~��-�d��d��d�m�9(�s<|����C�Qv	�k���~�6�6�#l�\��(��<x��K��	�����ghU����:���
>��P��D�����Q�Z�]'i]���,�g`���4p��K��]R����wj�YH�#Z
���������DA������$�Br$��Q���
���u"��������*�����S�hbw����l�h���l1��WQeN�����Y*Z��|�S���]��U=���4����Q~X�,"�+�m��~�����]�D_V
M��;Ff�X:����w�� �����u����������.WF �!�����'��si�g���M�du'b�TvJS���e��
�3���6����9�R>;=������������o����q9��#��j����V3�:�O_�]>
����xy+�y+��M�e�p�z��|���k��
�?��;����<���=��8D�r��u����
o����b�����i���S9���������"�3z,���k�r��`���;R�d�D�5�*��b�}0-���R|I����ax7=�
X
���mU�S��=x����2������J����+�}�?a��HTN�Oq�C�,+D��7�H�u;9"r�]�V��i��Yd."/�.2�"�$�o�P��{�Z����@+<.h=+�-G�V��V�=Rsf��S��P�;Q�K(���`4��h!�c����7�����]��T5�B���q�sq�������k`���]���jM^e��?t^���}���E�8u.�kq��m�������U>�����p|wF�Vg�0|^������(��G���am6��e��bo��4���ANN�xZ�P��3�)A�.
��I1�o�[��u���e�P��-lu��;JM;�n�?^����m�&]_�
�n�xv�0)`���(N7X������c|���
�'�;�U�Y ��HZ�o��|i�AqX��{K��E���h&ww:���P}���o�����tKa�{�N�{�����K���nR��6B�t�������v�sy���������.���2��1���!��n"�� ::%V�����K���U����F��FpC��'d7
�����KS!���|���|V��Z�G�,u�(x��xwi�xS��K���P}�:�{��c�?G���r5����$c�������ic��D����/���j���������:�">���Z�s;�����o]�x�l=R��j�4�������>�� �W��U�\
��1�EF�����:�^<NJ�<�_�>v�{������7+
	��w����?�����w�������mK���]��}����KKh���5�������Q���)�)aQ��h�&�L�r�"�nE�7��v��?��m3���[D�����QA�|S;��3���cC��w5��(�E=��I�S_Sp�JoD�	P�r�S���@�����R/��Z�������RW���_���Kf>*�������M�g	�\]�D���[T ���1���xTC�O`}?��*���~r���������=\�����P��������ox��@�����r��Hf�!J?i�S����a6�A^�m�����.��������6i�C����~���_�r�KAM���A~�?0b���cX���f�Fb�������3��?�L��5p�_>H�\Nyk�~cl�;����
R���q���Lj>q�_�����e<���?w_*������`<�T!�.��\����/^������������\T`���:~1[-�u�>9�h����'@��6�v����=J���*�����k�����P�Z>�S,��
����������8P��<�u���\�vz��Z���O��K���f�L���\5hzsH�|}�B�kT@�V�aD8�������9�Z[�_#�LoP��a���:6��W��]�W��}�U�pm�R��k�g���V��6`���b�!���ty:�!�Q���������D�z�������)�G�5&�)��:�:�~������
*��>=MB1�YT�oe�.���72<��Mq��B7�e,�*���\���������=R�]i��`k����N�v�E�.!�����2@�|p����c��>X_p_
���@���B�Y`@�
����������y�����gg�Rb���U p@b7�k��5������
���?��{�R�W��i�i��C�T=�Myp��e1)�^�r�7����q	i=b�������~�s��A��e�J<F�Ul�U�<�>���4gpM���c�v���z�x�����8��l��	/�����g��9��+#���(����^\�aX ��b���7����%(�`�B	gV)|?���g����Fy�0��0is�uH��(��J#
K��L��3��^2��I�jW���@}��r*�$��}a��T�A��7P�$��b�j���,9e�����f�~@�Z_�OC	����}��uqSA*����#�.Z�H0Ii����z��
��D��l*�,�E��;[�$(<�}T�!r�w�G��9lJ�&p5��e���!
E�b��:N
T�d��7�����I�O�Eu:���8S5�4T�_{KR�y�[�P�Z�W�Mk���W�@��	-��D����-������^����%6�C��B�+y��\��+���=��v����X'!�|����M!��g�gF��
�:����e�e��<F�?����wy8k/�Ky�8���:&G��6�����>\)T�:H�������H����/��F8R�]����~A���Q����$e�@�[u-1E��xt����i�D��������A�&�S�8y��y0!0d� �	;c3�l�+I>��7�������Kxt�xq)k�	�'�?�7���B����4�`�?��m�����.(�Z��)��T��&wqO��T���Es"$&&�zb����P��������*��(b|��O������\�u9����a�/��z��Q4���'hpw�������=xp��Kpw�{�����Yg����u�a�������������o_{�������,�r
K"89�O~#�<7�b[��~�H�;��~l����d����F��c�Z���(�D5����"A����m+,�1Gg�<�c���|aeD�����F�\8���
����{2Q#��#��=��IC�����u�r�<��z^:2Wy�Q�U�Q%��oj��u��~���U�n�Q���#.j���r�q%�\��cO���%��(�>4��7�j"�'s��g:�1�;>��nZ3~��3|���������-�����~?����3�R�>�r������T�"�����h���LJ����k�*RH���� �u�l��`gSb3���/apyl���y�@�\n"�@~/����fR�}h%���#����F�Rv�NO�'-X����]������,o{<q���-��c�b���Oe���V�Q�
5j��>�d�$����p����#���vA�TG��n��!2t�~�N��x�z<�!������w��E��[�e�x����3�.6���i^9��C�#���1�*-�3��)`��^"�O�������j|���������zE$����/���}�m����q�C)����-T&�	��\�riy��F$�R�7L�(�������H��@��h��SXaY^��0�Y��+.%����qv�Te��v��9�Y��T��}n�U�+6�F�~�Q/S�fu4���p�����E��:&l����):���C�W�����@����p
������{|�-=��P
@U0Dg��"��~Q�M���W�
����@��X�$��{d�A��<����f�"��%��vl�������������W��-g.L���)Y%}�����4_E��|
����@�ut���e2�4��������JxR%��5sb�����)������������f��$�u�����S��F�g\������u9k���G����1�o��[WVf�R�����? ���Q��ew ���W���\^O.M�f:4�/d�1x2.0)����	���(�g�@%��q�����k�������~c����9���]��|���;��OCc�L�x��� ��Kb����(��;Hu����!���������Oi�+^z�:���Z�O�B+�Al=�a~<X�!��5mP�{�dQa�Lf�C�s%�qf@���{ho&�
&M�e��n$"��J��]�NIx�71o:W�j���6�[�?35�o�4D?����?�Ih�:���Q����}�}'44��!#<$q�nL��|�U���o�^����!%�1_����Wk=�b�����_�U�-8���2��-�V���.4x��������|+M�7]�5�t����rZ(�g��~z�l:*�I�@��5�B��~yO�}��|�o�pO�o���X���Q��(�M9*������F�b�>����Ms�}7n�������}��J���-KQr�,�_ND������u*���N���q3�������E�
t�eD�)��6C�Y����t�jqt6�����6�����r���VM�
���{oU�*��� S
���j4j�}+5��7-!n<�m����������/�HP�%�$������]b���nc�$����N��6cbH�+��PE���!�(�)�
�a}7�G"K"<,^0��\:��cY
���tH�R~��B$����j�?�.N&���� ������B�n���|Xd�����'�����:�J�D��v-��hs���w�������g���1�c���,�K�����Qz�������Ow����)^'��<�Ok�/+�t(B�T_�*�%��%G�E�#�D��<}R<ib�PC�F����u�%���`������[Ai��w�6O�~���gC�<g�I�~����5/�,�1��M��$<i��*���e������6�����xj��
�w.��f������}�����%F�� ����mb���o��l����0:,J��_�
M��k����l����X�%��O��>�:������A�E���b��nn	���*\`�;���N��v���U���g�����W���`wW"
����a�w1���U/+�v�,t�����56���fVI����,k�����m�M0�,zJUt�g���t�\�2)P�Oy��_�����?�����=G�~���=Z7���q��W�#�	�g�0����R��"<Oo�'�W�c�[-�9�� �Q&���;��nW�d��{3�B����G��1-_�8�;r�:j���P^��D�1�#����Z�J���Aw�M�D2:����!C\&���:�A[}�y9>6\��db�d�j{m�oa_0����H������8_������g��}��H�������>d����X&�����H�lN][YU��[�>�?��j��Q�qg��;4`�X~�j5�<�����k�V���)3;p����!�4?�z7�|��~}�����)���i�oi������"6<k�������/<'�]D�u|0)*ZT�5�:0�%=YN/�D����[ZR*��F;B��t�C����7`�s�9���S��LS�������;��l�9���� ��������<���N�����+�xc�>��j�|�^3���7��/Nk��o���5�I���a7����?GjI���z?����} :=���������@#����ku�g���<��E�'QM�'%�u.�]��Z���5vu�W���������������l���x�8,y��Yu��Q�����W_�M�]7��9�����b�{��^1/�#������#��9�]����p���~�����d;e���E9�9�����
�x��K�@,�X ME�����k�:�m<�GV����O]�������9V�<d,d���k�\��]�S��86��J@Se���-�u'?���-��v�C���H�����*bJ��D�i���[���er��������g�6�,�����Lo�<s��@����T?{�8T\���
�F��_5����|���q|&bD��U�{����X��@����h,^�0i���;fI�N���K�s3	q�]�,�Ya�E@���v��2�X��v��BR �Lvw�:��RJ��ocU��_y�I�����EPEK����?�����j�O� ���Ak��bs�]m�uM�SpS���W��Rox_�zI��o���7C��<��o���da����*6��� �V��?�����HV�����vn3o�!�L�
�C���"����f#���~���-�$-e���^�������W��V�*:7*@6��&	�pK���5���o�o���r�A�r`'���D�q�"3���[7F��Y��\W�]�q��>����zb/DK:����O����:�8��p�w�v������J�)K5;/��(�q]V��4$����w��U�9J}�@S}9���������8�������W�~�k��.� fI*�����q��o��8d������a�`Fm���������"����|�QGJ�/����J�)���������e����2
[�+�F��8�S����	��B�T�7PG��.,������d�*��'HS&�"��w&@��+zk���[�1���K5�Y^i"���$�)�H�!q�"s���)mA��%��'x�8%4>k����h^���=�{Y��J,���`C��v
��������6.	+U[�F��51G����S�\n���;�l�h9_����]gr��u�%$�3��|hz�Y�z��1_��|�9�.�����(�I��9Y]�1/h��v��}�<O�<���.|��_i9���5~�[!��Q��guR��p�j���L�������C��_�|=1����#�LQ�.G�~4�� �3�&�����;���b��;�_ � 	���.zR��Op$`��B�s: !�O������mWt���&y�[���[E_j<�l�p�����R�9�����/JN��/S��_�� �j!������=�clE�����|��O�
�D�m<u����OL&�6�h�7S�lh���"�����*��<��j��5U��6�Q�G1�l���]��V����N:���i�S���7��8����_
�{�l�y�2M"��x�JCO*�WWV�*g�v:�	��Eyz���p	&��~)��j^�OVK�b�|U��z�yF��{S��S�`�����s���k�lC36G�=u_�s�0�Bd��<��}{����?�.�;�T\�����2#�8�5��S���#I�6hS]�qn�^�����O���;(�e��_��&�<8�������M'��(2f����������;+��P� �����xRhL(h�br��8lD�=��Eg������[�m��d����?\Bp�,TR���~��q@���I����������W�a����������l]������0�,0�=�0�#�h`����&_��Ge��������J�� ������S��J�X�Er��.%��������P��-���v�G�c�W%�6�Y6.#���]9�F0r���"����]����2
����|�/��qh��E�C�@m?s&G-\��0���O���B[�����o�����
���#�z���$���8#��*�u����<�����\���88�A��[�nJ�������|�X�9���U��6���E$Z�U��<�
����N���,z�?����2�B�=:/����>��r�� ��#�H����Tc����������,��u�_�^r��,b���'�p��&3�*��)���X��9��������I�~<�e{%xwT����6�%r+1��]��k7���n�sh����;��������w�@����&tt����@hG���3<}��kE�=G��Z�?�-��4���?������v�!��+��Xk}��3x4����~*�C�\��U�GO25������v��������\O�rC��//���?�\�|�)�B��)����$<(��}�:X�X�o��dvc��t�>���'�i����::�/v��9�s��� |��eQ5���n��R�G���w:I�G�6�0������z���+=s�V~#����|;��36�_{U���W�<�D�P��I~[�CNk2�]��������FE��$��#��?�;��C��n���'V���9����Ib!7����D�C���R�n0=K����?�r���R�q�9!sCn�y�/�G���Muwy~M��	N ��s(D���:}�,i����iyl���[�������+dZ�����b��b&z�����.�!�����Q�����lCyQ?v/�����P���'�a�H�o��_�[H�?��0�O���5�2��Y�?��(���:t5Y�~��;�yT������Q{;�	����;w�i��o�Jz���]M�R?��l�7T�V�-��9�VT?|�a:�b�?�.������g^;4������XF��^�w�s�h_��"�K����H��W��w�\���x��?)������/7����=��q��^�&b�b�A�F��T��S�=>���Eb:��~w3���_1*�+q���(Q����1G�>�S�����j��Y�<K����X��k����X�z��iJ�?�ZyFI�,�����i<��>�R��=����[T��J����������M
8��2W���o=x���x�>�������5��5U*�X���XB���N�K'CG��Y>���C�v�QO,F�/
O}^`>�!���������~hP�_���hc�7����C
i��V*���6T��Y���N��ZzK
D�[zQ9���&�~�������(�������Y���F^��Zf� et�8����;VV����J���Ei1R#���7���=�|-���!�����<}w��Y�T����������|������e�c�c
���%$t9�,n���qz��M��"��'��t1�vL��B��I5S�nh�`&2olq�l���^���s���6�����4�)�TVF�8����o�i�&P���?	��a��`gya�6�����qE�h�+�^9����Q���������_�e���-��C����������}G~��O�4�����/��x������*B��:W�<Mq�l�i+	�;����C����g�!h_��
���QS�Q����N���8�\R~��+%�q~�;3x2fL���s���q%��ji[
�4���Cy�� ��q��k�
E�T�����9��������0$��/������F�{�E�
����){,�K��l��G�R�4�g=��Ti��h@b�?�D�[-q��aa����'N��A��e<.2����>lm�>ZVq"� ����	A���!2�zX���.���2�10����-	�:u�1�����s���C�����_���b�Z(�Nc�qAe	h#s���o1o%y�k�����s����vr�
�;J%���g��+��*4�w����)-���Y;I�A.�p��:V/�^+�G��p�z�V���I��^��#�KoWo����i�7=�������5��28'�U(�,��sH�����Ht����Y�h� �+�P"
�ep	9����|��|R��l���~���?d�3[���<�it�*h�0�?�)���T�_����Oa}w������W$�}q(�V��aa=9��"���P�=v�,�\�xCF4s����,?G����^�4P5����v3��FRB
�Rh��J��e������O<�'���R�j����#.��/����#���JxY���G)��,?����P��*q�I*��
�����z��4p���U����a�
��-+�@P�����03���g7fe�e$!���I�<�%rlk��N�s����0�����tYy�����$������[��P�u�@�������Q|�����z*}Cd�I����8;�=$2�����&��4�eCl��w(_���	�����k��,4a#�F`��2%z@�h��U2��
E�w!���Rn����8 ����2M��/�C*�����AX�v���3��y@��E	�/��4�R��@e����`�:X�_Y��`a>���Pd��U�����hP�M�Qk^s���xa��V�|/�@��v�k���5W��d���Z�����:�6"��at�M$x� ���X���9������x��e��a�O��$d]~v�����r��f��wr&�O�����3�D�����KP��~Y�8��t�sDI�I�U�Tb���+��:��<:@���;�e}Ft���^��/�P�5G���	2!�����O5�k�P��/��2�������q�dn�wF��3�<~�b�_{wC�n�
|>�_�=��������i]�*D:�������[��O�/d���Y�5������&��^��L�5�'�[q|,�c?��7�n����<|%���&�N�O[	�Uq��������VOJ�b��`<Y�7��/�O�g�M|����6�>G��=��<9�������2�������~l�<�\@u/+����(��;��c��-���#mQ�o�����l�tz��E���W�vM��!���~Q������x���m�Q�p�=�-=B��3�����A�e��.�Z��s���i�dM��d��������VT���PLS}��us0S1�^���]M�����6����B�i�Olv�yW�!=�j	�� �K�Y�a�$��opt���z�+�
L���\u	�gU���d�D����pp�0C�K����'��U�9���%E	�&E�����\��OA�S��oZ`Y}��H��	e�6���m����i��[�4x�w�V�������B����������PC��P�4�i���~�-d�jJ5�������+:��f]�JH���N�e�5���I��[�3:�t��9�`�������MJi��{P9����
�-�l��w�P��}:CSg-^�����	�A��nnL�_��c���sk�i����)������a4�RsV��S8�^���_2���v't}m=/!x7���3������������������4�q�+v�!j�xaVP3C"��ocF����}"�a}�}��173�����������G�d��~ps9/�������O�����qs\��>F�g�&���K����G�K�p>e
7���{;R����w���X�c���_���"���N��Ns\(+�;	"�Nynf,J�g���;bn`r)���Ab{�sz:T3�zT��]���l����qk�J�U��l�$c'����/�c	��+�b��meKo�L�de?l��b�����u���r��n[�MVF�?��	����a��mY�<l"���P,XP�=������Q��������O���J����PH��������������o�h�J<s�}dgd��[j~���3"p�8o��$��fu��l�u����(g��'�!���<���!���u�����F�w�=�;���n�#�S����?)�+�k����!P-�Q��7���g�'��>��70���&�-���������r\3�%�_W����i!����!�FUN����}n�����'��a��Vxs��)�|�4���"���2R������D�"��-ug���#�������I��O�L���iB[�L$�q�_��8���A������1��,���&��?�Mh����ig3$���%�����U�<l�R�C��	?r��b-�b��=�����h�m���]�B��������-��c�X�����R8%?�^��~py��/xP��q-u��X"�����8����m�_
i�c\;���bZB���i� &�	��&k��~t�*��o��!���]��4l��<6�������&�v�[�n��]H�i1����#��]�M�����s��*q�KaO���	K��E*����;O!x������|,��}��}�D�Q�z��L����
E��{8g&d.����5����
x�6���P�!��U9��A�U�)�w���vO�h�<��<��t��l_4��*x��bsz6T�N�_�8D{�������K��9��B���-��P{~�4{����)��������!
r�Z��g�7,Wh?��t�_2�F9�3������=~��<�_���C�8���H��n���y+�wuG�m�I��w�+9�L�mC6�7,��L���[��S�>�������j�w�2%R�r~�-��a��!����$��(�+
a�������pG`���]]+QD��Z�s������.z��D����Q������Q�#�5�Mk�D�)|��N�cO������|�k��.��8n6�,(7 �������ab�����Q�����	N��9��^ho�/��X��,q����7�%(�;�6�q���i���#=J��\��e���F��r��XL/�E���/�i/1�{nt�����	������^7�Y���V���}�vD��	��}�k�=��]��6:�yf�7�k�I���^������I�KAo�
�-��I+�N���z��^Ja*E��
x�CGOr�C�a�d�:�{%>��X�6l�	����[S��b�x�"���Mf��b��a!W���/kC����??����`�^�^�:�;��v@d4���u�{�e�d5~K,�q���y���P��Y����m	���Y2�'�gF���W$+b��5.�r�w�R��0����������b�J���r��(�����Q��"��6!}��>W��5z���Y%��Z��M;R�j����n|Hd��/= ���E�ZMU��+��p�
�a]�8a������X���dW��zy����E�NgO������*G��j���oA������ �����X��d?+��]X-����vw{����	���eM:K�hR���KD�����8
}�wF��O�~�M����{$�C�f�@6��?5n�8����������W����K]s�)XU�!���' �D�n���my:^H�9Q��h����.T����)*�V��WfC@���Q���c�E�{,��uyk�������t�B:k�.r�}%������'��H?C1
F��-���v�_P��#}�EK��n���Rs��i��r��z��fi�Y����� �Xa����%T�Y��]��#T����,��C�Y��@]���42��_��}xW��wrJ�fM�8�\��0�^�g,������8��+�&e���NY���:���l5a���)3���_����D�����3m<[�eGM�]�b��|`e�a3�a��)��w\�C��1�iH�4����`����oE4��n��xp�5��@|-��X��A��]jZ��B��P��i���A���y3����8�c�,_��g�H2�����9��h��
�(��O�;�q�A�����i.�����Z�~�/5�{���5������/o~:6���q?�)���p��(�����'&��P�Y����2�Wz�"uv�<e0�P9����r*�������#������G^K��y�~�V�������u��$����z��(������j���^vO�����8N��o9�u54��r!�b��WC~�	e���K2>7?�$�7�5�Ywp�B�w�s`�8��K�_-�U��:l�q��s�U/,������.�T�7}?f��1F���������Q1���p�8%�Ii0hk��*mL6n,��w�;����a�������)�a���ga����+tk����A ,bw9�����;`y�=9�|��0^�@����y�����x��I�78���I;E����X�S�|�s���eg9���%%:#������q��Fn�ik�@�]��>o-=PK�1������-}�\VW���o���e�
���j����bmq|��g	��w)X����3�-EM�_k5�&���5�
�"��2E�P�_�P$fS��UY���D�5_ZV����������Y�������Bu��*v���8�����_���mY�����"g������&l����<����1��9|y���v��J3- �_��Z�4����K'���QHqq����PY�3!�n>255���#)�	���&��J.����T2�~J��H�O�bf�����bsq��L�q4kP��!,%���#[Q)���E�������y�`�A��O}Z�sx�>���
D�A�
z��}C����m�Q�H\U:N�M��:h��X���B�X�����?+iw��KP���4�e��[����x�7F�xx8��S��T*�]�T�#�7��yP�^~�	X�����������vCD��Jg��r������	2R����^��}d�)������t5NJ��:��$	��S
G.*�2�t_��{
���j��!?�����������"�
�����k����y�����V���dp���d]�ZDU���\� 2�U��L����5����)��7xZ\#5)��N�	��[�C�z���)����}A���;u��Ox�*���� ]tHmO��)(�r�'~���b`F~�6-�mfY-�?tXD��=!�;����V\�f�P�uf������3^sF?�X�p@O�G���4�����J����fb#y�M�0���#��!&��'�� ���%I�uL����p�O���F7;E��������]��`3e�`\�[�T�o�����v����2P���X��,<A����
��}Q��B�������
g��z����A���m�~����%���d3��}cy���^��C.��W���CL�rY�O���^�H������a�.��e<����8@�#>�,��h��Rb�P��o�����X����^�7�]m=�?
���=���Q?�(n�qp� �#7�
��_*���x^�����@����\��]a��Hcxb�X'X��`�d3���s�n�7��US�5B3��A�b�����V�8Y�0�[p��
���D��|��u�V�]�a��&�N?m%3���/����8HO�I�C������I=�v����doQ}
�����c	��C�QI��/VzR����~�����6��SP�����,��jI6�
�U�~1�)�f���2WZ�������������Q��
Mw����e8B�����<�Td��6�*?o���X�
!�.����y�����4��I��7_�	�o��i�gC)sz��l)`���1��E6����y�0�O4�����BR�\�F��_<Fj�v����P���ckv*}���K�Y]4���.�<��������s!����&{�	X�?�!5����g�F���������� UB�^S���"2������6�-���:@��ie���HV����W�������m6�U	�����b���5��]�9p����`�a_%�{s��}(E�\�X�A����hR���U��F-�F�2��Y&>��w�T(��g0^(�Q���bE�����v�0�������gq��!�v��w��9k��$�����9GE��+��Yc�X��|��^�"��N:����,.I�m��:x!��e��d��$��q5dLO�������n�2A�oP�����h?��`W&!�����������@������TL�����`�*K��,v��k�Y���w�e����*�'8�����g����>?"{����/[V0#8�[t,c�>xO�Wy8�tf:]�lU�kU��i?g�L����{��\<�af���M��N����3y�7x�aa�����'�5~�f��jP��[R����0���0I�R��7r�~F����,)
�&�����~���3M�V5��h9b�1��f���NZ"�nFmb��b��6ot�u�z�;���Q2bmb
�9	P����m��u�[���+4|��=���Q�	1� �ZVH�ru�|�������7��h�tYF�^C�*��E�E�je�z?���"���oF��R�1�|�y$�{G��#Z�C��w��a�[���v��_�Q��GI�Z�ir�f��?�d�f���!�q���6������W��&[T������8�#�)�����~�L���H�bl�Xoh�Z������ij��m�b3�NK��dv�9/�����tH,�a��;�����W��,(�m���\�Ftc�m��b�em��n%-�)�8���V�3��>)ez%/�TQ|}���6�@p^H��#���#l`MdARo�x9�_\�l+3*������
�������Y��TB�_����z>�+�Co���4k��������0�*Y5%FA�����sw!���^q/��TB�l����� 2�dGT&*)���$[T�"a0�Y�Z��_X����c��9��n��uI.Er��B1�n����`�+�<�k�p�=f�s�n�-��O��?����n�=��4,�)���92z���7����-
�����-�����+����:�U��������ky�.6���\�MX�3:D��u#���|j>/�����!:m��*���*���c�.���e���� ����������?B��	��}���M'��)��/MW������/���Tq�q�|����I��k��`�z�O��(%}<=9&��=oF��gzM���g�O[������!P6��#L��ck���!��xSM��r)��A�G�	r��F���^�����:cm�����}MF��W��>7���Z��H��G����us����e~b����.|S6����W(�����3��r.��G��:�Vt��tR$�����#����n���-~����]�	sZ �Q�&���P��
���~9������TB�����h'���j����?�S<�C�$�fB����������)����������\�F���������d�mJ�~R���KP&<��6M9�(�_�������L�����
�c�������y�d�?�����D��^�U�D4��nx��J��a���G�IK����H=*a���U��
����`a���8�c~��P5����%�d����r������S�&���15���XX<P�gpVK�	�~�q���=����c-��
?��� 6��
���H���{�ZR���H�n�^��;�����O��Ka�_��m������Rh��d�p��:U�Y�!���ecWK}#��������d�����y���$�-sdq�/���)��oO���_�0���t#�,[��F��^'��-x���+#�O������t3�����I�'��2kK��D@��N7���Sb���3����lt=	�n��&h<�)�
W�R6=&��|O���v
JG�#� N�_�����P��������6��^��
��U+�k{&�c���H�d@Q�\z�Ghj�,O^&"�#��*,�������n���^y����7>!�����a���C��@QHLc���]�	�+��
�,� f���?%���P�&/w�
#�s@���q,1������K���QPzn��_�,��~�E\�h���J�_�;���!~�-,��j�
e�S��������M����5'�o(z_�"�
�����u�_Sd!�(c�r�x������M��3�i�oQx�
0��O�����WA��VE��@���T*��[gM����M{{�/4��<&&&(�������}�����UX��fuS���~Q+�U3<��Tk�J�K�Z�-�a��R���O_�y�JQ�DiE�r)�2�&�|2f��6(| 1,��K��FK?e
1��'1���&��v�����*|"��$�EAn�n��}��m���/���������Y-,4�7L-�M��E�?@�qT�I����UD�/����O7�������)�?vL�����!:�r>�Y��
4{d{{�VN��m����$����
���GL
�[�����q����-�CX��%����/�?(��yvk�k{�Y��%JD��t2e������9R����L.E�TL��
D�64�a78�y�/s��%�g�&����JzH��X�E �>M2��
��_�����@�#bl��dh�Q��
,�
�M�U���Fs���n����I4���J������
v#����Se�������r���Tj����$�~�u�/{��G��r`4�}!���^��$����ZRh>%�d�w!?�����
�S��l�
:������|��>��1!vx��q�MN���d�ro���	;wv����s�~v��L�
��(���	%~
��MS�w)OC��/��mJ�#h"����6�}s-��G����h��}17eb��u@�:�������&��+4���99��k�2����V��?�:���4�4�1i�3�}'f�y{
����t��P��-���>r��]���d��5�+������uTgE1m��WKx�$� ;�e<��P$���BYJ��7��l+G)���$]�P.�^7P�Z�a��<����k�6��?���|�.*k=�k��i��k���:&�*�R��y�`�i'��������&�1���>�������f���z[���bp�[}��o_��6�x����2��b����t���t���v-��V������]��_���BP�A4�����siH&�~��4*���J��Q��nD�%�4,�z��d���&��.��F������"E4iL�n�x!�������=U-}���U�5�~�)�8^���a��T�R����j��9J�����g�&2u8�[-��O���l���<�
� �in��=������_���i��Z�~�����rg�*����`��x�p��������u��r������������
�+3u�x
N�@���2��#����|�v��
j��:������R���:�=7�	,PL>�~*�]��n������4��h�T�qS���:�5��)y�l�Q������m�k������EX��3������f�B��{����"P7^�������Zr�7m�9������$5@�}z�4�V. ���l�v�{!4��_�3�8��>7-��v�Va���~F�V��^.����#��2�^|8!��F,;�a^a��QvM��P��H0�=��F�<�5S^�i����w1�f��ZP6BJR�R��	Y[����AL4P��ge�@A@`�
���K���X[w�����Wp���B!�N<op=6���(E��$�g����F+�����2������}��zC��=��1R��F(�X��l���E�����9�|�	C���(��
,��
8�O�F0u}z�i�����Cd��u����wT�_���E��~���/np
M�o��H�V�����fK��B��T�Bl��_%K�M"����J��R�=����qs��c�J�D��Q}u���������
�:HO�T��?�M��C�F��ch���	2Y9_]�V�I\w�bC?e�����kf���!��q�M
<����L/������b�����o�3���d�
�2���]a)��o���
�~ZZh���q���qiJ0�uH*���%"���n�����Ae-������f:���]�����`�����KOT��lDm�L��I����N:��$���b�<"�)�W
��V�b�g��od<�����w��p�I�A$t��Fs�I���%�?���
%��x���:���B�m�����!bz�vF����	UJ�����Y��������/���"r�u�`��d�6t���6�$
A�5�(��������P#���������Q�������q���_��{`T�~
�u�&vA�[W������NXK�or�1��(?z;3Mvi�m��]��b���p	���t���a�����B��-E�'Uh���aA���>���[�S��
���������*�KS+^�����0�(��'E�uM3�v
/&;G>�;�8F/�;������p��r�v�4c�������2?9��RU�pn�����z��p�4�����t�j?CX���Z'����W���,��[�09�G�
Z��i�|�U����yNt�o��0r����/�
M�&�A(���-n���;�NC�������YY')�WzT���1�
6����z��l/����<����J�U��������L�8��S'OJH���l]���D����7���f)5lg����Y�U��o�����B��.f4�8A���g��M��k�!����\>	�t
�;�R�����y�"����\���Nn��{��fz^�����z<���E�i^�gg�9|9�.����>ad��������
7��8��h��T=���g�r��1����	�����y���G���e3��K})6��`��1������� �UY��,\���M����>���2��M�7������+.�.n�������1	p����Y�t��S-�'Q��%|l��#�8�}�l������K��;�]��`��6��wu`���QW�"a�P�
_�?VK��I�Iz��	��$���(�(�d"
���yb�h�?�WM�m&G\��������O��%^���&��*`�]5�2�0o�&-�m�o�����%/��t���'�'����5��TyR��������,0
��5�&��&���������:{Y=]x|�Jj9081���|���=<���N@����Q8-��(��uo����2����m�%��|y�(�uGc'���
�@�}CmT�y���gR��������:*h�C1S��N�����\E�'S�'_�%��r�<5��sb2f���8k��yds��V%�&�J#�l�9<w_�J�T���G�����
������1�o�D�,g��>x��kJ��3�n:>_�F��\��J�s O,�0"������=_���7��Ie���ag������10qC���o��������y{�+|"_�� �z�I�MVW��kL�6r���|�a2
�A5��W~����.�	G����<������V��9+�%^F�Y�I��2r]	��pzG�b����l�s��5��Wr�o6U����C�(���	C������
����}�Y&1&���w����!�OTRx��+`�6��pH,��)+�d{�E�����l�zd}O���;N�_u��%#K��e��M��zh�]s��}d�9�pI�����g��2�:����Z�`�����6��+.��*�����W�!%d&�b�D4����xgHX�;�vi�6�*:T��^�F`�9+���uV
 ]@��F]C���{��Ek=.:b�l��L$�C�k����9����I[����V�y�!;�:_E�k�����������HK�u0��zo��O�q��Y���6��4q��L*���j�5�0����4<����U�!E]��!:�E)��W�}���t�mX����?�4/��t~��H8G,�r�����?�H6��ok���V��E,�9��������S�J#u�H>����'WGZ_m'�!w���8J�� ]����|z�n�r���y>��c��]���R����I�;~��^�����J���t`�O�~F�MJ�B�d����$>h<?��Y
{;���c�L�;I��t��?����ZE��������-�����a1�����u�{X={�}u6}{_���� ��?g�}E9�b_�BT4�N���M��P0�I�� x����G��NK�GW=��?��ag�S��!T9���xf��	v������o�t���~�1�/�c�VB�����)���0�	%�}�^���}Lh9�N^�&�E��o����H9��$��|��K�X6�o�~O���x���\���o����flPL�-����x����i��%�����W���9"������T��(��T=h�tKs���
��8�E*�(n0p4����$�/����-u�K�v��Dx�������
�<`�w��qY��B�(U��sxH� �C��Q3���fV���|��es_� ��k�����\o�ja����)U&��f�ZyU�c:1�&��=�s4M�u\�|/�H��zUw&G�k|G����Etg�j��BN#�?��R(�0���/.��D��qQ��G,w��o�����]���4��5��[�e����P�hTb����]�0R����*�G�G[�m>~�<:W��zT���F=iQ���P������hw����@K�J<h����Vo>=YZ�x-L
>�#�����E��1�7}�9 yP1�V�������?���VW��s~}@����I)�a6�"�M-�C���)rY��v5/`�	Vw�r���4t�d_�Qb����5;^������KXME�I����[z9l�����(K�*�iSCYb���b����)��|�c�[4�>�o��_0�l�As�v�|�V��\���'8Y��p��x
���:T�%[*�Zr(�a����:�`s���z�Y��UzkF���T
<�����I����%T�����up��D�7�%��������K���0��Ctc�Y���u.��nF�"�6�v'$�7��P��}��Y��kln��N�o�}����,����E6�cv�KI��V�Q��`��;��;	���U�[R���(~v=^pG9�[�^���[��W�p����O�=H�	��@�ae9�%�Q�)�V���t� cH���O>9.���u�R�L��h�������4Ee��ZqV[��]���,L�9���g4�[�$�<���WhNq&};Y%	��Yc:gu��=�,�qe��uU��XhT\�����C���
������J���4,|k������^�=U|`V���H���5Vc���4��y0E��E�pu��
~����\�����Y��
4$o��x)��tS�O��58�r�Gh��6{0$�����G�������;C����ND#�^�.2ec���M�x�o���X��$\<���)e�k����|��s?�J>�E���9����j�y���~[�B}�xJ��!�6�u��m��?����e�j�L�,���6!��
���IzK����1��g;1gT���}f�����&�TW�T%�ug��3��|9x?�:)"G�/V��\��l����J�-,d=/W��6���b�3��D�ll�x��&0��/d������Q~p-pn�A�>�2[0A�9`����!8.8��
-���P���m�OO�����|Z���.�*��,���t��Z�����)�	���h#E�f���|?��b=t�������l�:I��F��qG#���F���'��-���u��S���������9'~��ezZ��)h!������z���aD&�������@�������d%&�ZW�P�s��u_0+hj����0��=���7rH1	�y�����B��DR[�0|�7bY�.W��D��L����O�\t��"���r��f[���
OG�p��,~�����D~��*����S���Tp�����h�)�$y��~�7��Y[^�sh1�y��>"��X��%���
EwFQ�d���T���+LB�d�Br+I��d.Wu�s�Ul�����a�I�:o}3m|c���q5{���H�B��bA�|�����oW�"�re�|��*����X��Az����c�k�Sea��Y[�Gz"p�
��!z3�E.�![����<%� k��!���*>��0��IN��K��X���r��I�or�����v�w�<�w�r������6P����4�����v��D.(�d^���[XXG�k�F���i�n�K�Q�����@"|u�0�8����?�AZw��*(���W��S7��Q"0p�)^����Q_5{=��Z��K��ypj{}�����5���G���B�QR�+��$wA���(��Tg���nN�U�<:�H�v����eU?|�M-�o1kP����VRF���{z�Z5cs����*���|�E;S�8��B���n��T��%��#��_���%�P|�,o X~�~�.���	A��u�.''�M��[3����&��a��x{.����)	����0�3�M����O����h������W\�l�3<���rI�Ip�|y������BE��zr$V,B� ��
��Gm����>Q��K���
��6���33l��Z{VI�Kg���$���w����Cv�~]�d28���~�@�� �������=8D����,==�H��'B���$���c>�s��b<7���
z���/:��*�������zX�0t`�������Y����oV_*����_�
M0��P�����k;�9 }o=�cM3���P���� i�O�K��	#�����n)%R�	"���TU_(P�6*Mj�X[��Oq�M>���Z2���9O�P]���tVI�	���U��Ny��(�����I-.b��6Y��y����/W������:k�1�	��|.��y8<��W�{�{P/�����5�
�5zOc�p��wkX���~�U�#S�#�M���8�}���aX�J��O��=s�#U�QQ�Y�eV�#W��cI�Y���x�@��^�,�`�q[�H��f���4����	gl�������q����L�.���H)���tR���A-�5��������J����$M`�|m�%��4���%�D�-L���2�x{3{V6��>�N:�QK
0��x_2����3�^�I�:��{o�A�U0���=�X�6�H/L��;x�f���A_|<�|�Y�{���Q�R���)����C���Y������I��R���j�K�_������4��UE�
�5�r1�H����\���P����,�Snn){���2��JjEoq���(�/���;�l�Xq������n���,C!�b�TW'������}g2K��^r��T33����:*BU�GB�H���t
�������ax�����)��{��A�h���W�nF��V�J����Z���|2Q�i��v�Ib��W<��8�s����N����;���d���B���_�(5���U�I��!���k��	���T�'�:B
P�V$��Vs,d�D+pV��A*a�������j��aL�AR����R������c2r�f12~}�4��a}��3�a#T����f��q�]����� h�_�vH=�o�3���*C5����E����E&�9&-���;�0#��)���}y�bZ��R��e���i:�4k��1��w9���h]�p�S�7+�%#/���<ve*�!��
s��K��a*�<2GB�8��y���Bm5��
��;��p���oJ�_Xg�Z�`3�������av3
�������u^���D�� ,`���#��m5��>P
�+@�3����(L+��|.����n0 �-N:c�##��n���������2g��goJN�;��X=��X����M���7(`6#�"�F��J+����A�ka2-��������z�Z�������
��M�j�mr�����$~���>��T���[��&�Z�H��1��`�ufa��X���1�n�[E���Qw�+��V�����3P��`E���=�Im5����g���9&#l�Tn��5���g��5/����;Y�h�E	��@��J�C�x�������{?Be���x��4�e���Y��C��sE��������e��,��${���-'�me�sr��E��R�JH��Z�oc
�%���*����ST��[W�)����y��o��	l-�������<��<�1��n6*������CWL?(M�d�u��[���(�8�u�s�������s��G�	�3�S�kL
�y�B�9`G�Wf	n$�a(5>-�3%x7��/.l���4{/$��(e�
�J��W>����'}i��uN���L���U�oE/J�������fv?(�~-����>�}-l(�i�=;�o_�}�U���9������+�F1�}���c�'+�C��h��i��/����{�8��*ml���zb����=2,�-8�uE:g_^a53TP_�����`�L~��Vn���4}�+7���+j��9�vm�'M)�PS���������e�f�����N������0����|��S��@���+9��IM�D?��!\�d���Q�L���Z���G�h`�jd��}��cH��f��f'��R����	��������x�h��N�I�����g��f�K��d	�?�BJ��M��-*�d�zt��&��p9�%;H2�,�U�EG���CaNEd�$$��-�I�Qf���2&�v}q<H�:T�8���tN�������S���
YDuv,���#����><��Ih�\���2���!�������\dq�Q����0*�^�B5�TI%"O���3�v����T��H����w�����{��B,;K�2�6��V%���#5=t�"�F*� W���
�&�hv��,�����c��z������y���J�����r����,������LCZ��h�cG���?�
@�����x?;���e�k�&%�Pa��Q9B�D6JOc���A��`�����J�3��)i
��6�9m��g��v�)������&���l|���TR�6>I!��������2b�P��]����N��6?�c�^��v��D�8Q������9�G]g�U|��n�����.���:�j_/��r�����E��A�������/�'
9�i�T�r+���U��1��2����Lp>p�c����)V;A9����GT,���J�D"\���n"	<4����y`t�z��}"�	d�����:!c~�&p�yB�u��p������Jd>a���k�� �\gK����9���\�����o�g����eW����9��Xb�d&�Ex#�^hr��'b��#����|0�����y�j�s]������5;�j#���U�
{��'�:��yQ��$�0�Z����z�MTPJ8�n��A�����I5����H�4�./�9u�0�Z|�0�6p���]���S2��:�
&X����Eu�A*�����6����M��������'_o�+��������MI�i]��w5o�������7(c�s����l��/�<?{%PD���
Hp�J��>-���U��_	�Q
����{��9��uP�J�aB+��L=��wv�
��c�mf�US��/�l�����1X3����)r�c����`��9	�.\RLt�D9eD���MR\6 �4h���<u�&���W�b��A�2����W�����n~z����w�����s<��k����MlN��P�Xb�X9����J���
)U�����X%]d�o��C�����T/��bm������J��	a	�R��|K4R!G@Lw�=W��c�}�4B���-���_��_t��H��{��B&�A�h��*gs���W�I�*y��D���Mu�������rcR��']�����j��@��U�`�.O��*c�RP���\����&�Yz�"������=44YJ�i+�N�����\"?3����P���*�n��'�k�K����tm�k���Mn����>x��p��Tq���P��u&�U�!����i�
d}��5��*�:����I��`��5>�v����Ad5��/���c5�+�x��|w-qM�0��/���TV^7�PL�YH�Q.����n���7.�RJT�Th����J�����r����}O]�����&�%�Z$�"����T�w�d:����i�5:�R_���V��!n#p��*���h�	�]���gXnS����}uE�1���$J�(H�0w�dc��ci�vAN�GIU�?|/����'��1�b�+���o�R����m���+�P^�^,�IG[�%]����v���
Y��bD�2mlK�����oW�hY��6���D	��^�a��	
��R����s������W�Ia�?�y�����_���d�Y�0x���R7k<mM�j�[�����I��a{-��Oo
�����he��1�����)�T��	��C�n��|�}�E7f���US^Q&�����/V�NL?�����4��fO+��o-��E!rU��)h%���~XgyDl�dDCb2�<x&]���PS������y��*���������N�A4Rd?�_WVm����EV�&�z�ePB�LD�����j��Ka��aW����&�����k�����1��4��'n�.�n-�-�@E�qrIk1>�>>l�D�oR�������e���J��� 3j��������K��L�eL����t��3�"�����@m�+xP?J�aN�
/��~k��/�����V:�"m��[(����2.�b?������<����nT�x�8E�])Y����1��#��`l��KQU/Fdi��P���.u1��2�A��q����=%Q����Y6�9\|������{�#����YF�Y1aX�h)�=�
�y?>�����S������2&s��<���Kt7�����4+�,e�����jT�=�F��7��T���3{h�5��+�����Z'CRXF|hC��^_�/�am|�i�9���5�����������j����;[\�W�Q����*����8s*g2^�=���hI��F)�����d�)J���*�]Dai��y�8{hL(��p����7/��z?{��</W���I{)�����.z+ez+`}IJ�}L&4�X���6�g����r�g+��k%�iI�a���������������p��S��O����,��)[����\
b���v�K���G��_�F�R�u�������/s�����Rq��BQm"�T�72&�{	�b��L���{�u�/4!��S2����3�X��1~���6�
�35i�zi���F�:���aR&I�����2���s�A�'b���{|��d��J�����.,�H��I�g��������J���p�=b��M����!;(���U�*��8���b~����4�C����Fs��|�V���<R��c�o{�Av���'yx���)I���m��smj���R��$�������l�����S=�o����^X}������b�h7�^O��JqI?Q����=��wy6���p��D(���#5I=�������8��C����F4��O�/�����V:t�,��'4��-x����u��c�;�H��~���f���2��`?37��jjepl�ag2~pK>�Y>��C��f'%�!�3<]�O>�����I��XJ��<,�Pb�~g��l0��x�Jtr�;�(������(�����X�O�+>c��4n�����u��y������������wb�<�~���$ns�>g�G_��J4g��2T-�B��W":Z�\:�*:-����*1&�c���H�f��e�=4���iq3������r}%�6s����6�A_�b��<�c`��0�t�Y�,7/u�j��+%D����hJ���Yr���;�%;f��N�X���"����^/����}^�&j���(=�a���Nr�+L��p�f�0~>���uw?�*b�`����1zf�'�.�iP3�B�	CZA���M����a��HO�>M�v��<�[iZH/r.M��gN�q�b-�Z|9�w���>
y��9����Wdeb�8���dwX�AK�b����c��D�����Nx|lg��@�u)~b��'9rL.�	h�SL7�X10X�V�5�@fV����.��;�_H�!�Yv�>x�
�)+�n���S2|�-�.��kG�*,���'���FbQp����I����}��kW��Mk9O���k|;�����R�n���WGW[S���S�^#<Q�m��Kl���gsg[+1##��l�����G��qW��.�g3�G��^�^#����c�^�����x���z)"�m��	WGm�ugTM;G����s
C	P�M��9�DFg`�/{/��]8�2\FYd9��:�`��M`�!�����V���m��%$��m
L�N���Z9����l.-/�6��3F�XE.[T��W�d�����C�%U+7�2�M
|�:9�r
\�]�/��E��/o>.I����:T/t)��V�y������b���%�,���R�������?.�����Z�8�XG�S���r��m���^8<�@eD(���Vw�sB6;����P03u��e�����r�rBvv����:V�1u�9C��d���L+�lOO]�/�n^.$���o_+��T����@�61a�9��k����I��#R� b{9wq��z������x%�~�5���4���/�}���"�qC��7o��0O���*x���A��g�W��������6#�3�>#���aF=n�Z�����>��I#n��!B=n2�;�u����(�	{���me����>�6&-l���`��C�QU������Y��TF���%������#?l�����o�!�G�$2������A�l�,���Q��3.���m����j8k�m�6--���4���T.�GW�S�3'��k��d�n�MO-^j2����O|1;��|���+e\��|������f����nv=�4p�i~���9�E�gh��7����%{s�����2��9s�9^[:�-k2a�}�ns������P��^�yvnW����)W���W>l�tbE1�-)�
V�^�G�.�~���u�P&�sym�p�X{~A_R�U7`~v}�R��u�0�F]
�������3.KC�g1o�_j��3�U�Q~{��	+�Jd���O���L���p�Rq>��Y�?�������
w�y��f�A����"����O�Xw����mN��mki������s>m�o^���?��{
����*���ol2���L�����~r:������o�_��>iN	��#~(B�����ck�d�����n�����IM2�t�����fWs��xP�WD��?���|&b0p|J�^3��"������%2~'<�]A�n'��2�I�E���7C����]s�K`�f1r���)/_��<����n�I�"Az������Q�$7!���]�{?r^�*�� �$����/x�p�������m�Vx�YE@3�������0�.BG�;���=��3��g��%����8q�e��3Z�h���Y4>$���F@���\�������1*T)������$��Td������e�k�� ���}.�����'�^��3��]17Zj����f#N_@Ym����7�Kb�T����2KU�(~Akz�������'<9��]�A�)�����x�5�{�qv^
�+(e�0�T�`��Z����f������e8�_���a�#���T�*��&�A��F(X����4-�@���ul��
�r|BAX��Lt�r�*?�kK�5�_R�r����v�6=��ve���+�������d��%3���Ls_$-�6�lj��P;.��e#(bSs�uI����4T7�|�cP���Zor��M��zT���]��=���45>�%�U��������8�$�6A$�������s3h��7�>���84�K�O���B�?�7n���jG/1��" �o���k������" �R��3�����
�/���B�f����n]R������r[p���7?q����������)x�;�
1�������
�����9�����	9�9+
��O����E��vn.R;��7�D�N4�g�
Hm��U<��q�7���Mq;Q���Id��x�������=l4�"�7}�,D�bo=��`���3���&x=J�n9(#�eP���F�h�n1�"���Y��NC�+2c�NN�V�qr\<�,?�7��[�,��	��6��������O�'����?������3��X�>�{
�����l��n��,�����*,7��sf��X����ok�!44����6/��w�F4	w�����4t���m&Sx��A$�bYjw�[C�������m���
�H��wM�pY���������>��5u{���S������������'�[������ �����9}�sd�N8i�1#4�y����H�GY1����!+C�t���r�N���r�����H�7J�7���#���ZU�H�rD�Ge��h3���K��
�����7o6��[��������P�U;�e�����[yZ��h�SK��-q�iK���w�t�o%r��uv����q�XG�z�
9%��
��X��kX���b�%:F����)@"���
�-�Y���>�V�����4m>�t+1x���V�������U|P)�[��L��$~�}D)}1�����f��x��koKN��Pp;���9����&QP�l�[I�}�
��iE���l��@��i<�Ka���"���W�Q|�}�f�#�B�	�-/�4�Q\���w��65\YzV�T����;�������(��!{R�:�&�H�QO�����|���P���>v��?�sC}%Z6�)����Gv���L�q6vp�n�6~G�0�:4�G��
��i�x#��>���a�?�)��N;��G�(���Dz-@R�/2��)Q��7Q+5�g�
������&F4�N_�w"��t4�>���}�j�h������#�d��:�!�E$������(�Y�2�����'>$Mm�	g/��'�us	;�-�FlP����_����o}�[e�}e�u0��\�����+�+�|� ���a1�������������D����_����/���#��3����s3�L ���f��/<O��;�'�#Ms#pQW`o��o������6���[��PjC\`�������n��3>���J���+.bI��c����� �����6�5��LO����M3:AG�>4�7��lr������?H�=���b�)�e+7
���������Y���frx����:
�A��-R8H������f�h-,v�#m\�5Ge4�z>s��x:>&�@�>��~������8��w�e��Q����Frn�����C�����kCjb?A�>	��F�l��v�[�������F(Bs�����~}�%�E�)��7�Gc�%�&UF�7�Y�j~���y�pU�Ex�g�@�Uv�?��}��_h�r'�P#9����nV:|OE�#��t1���������]�*?J�k6�w�4O���2	���x����E�daG�����)�9�u����,�6&�y���C��SA�M�U��ew�f�������!,f|����Q�q�:�	eN�k&�|���Cc*BK�(�ru��s��N��i��1 G>�V�E��A���7D�MaAH2p-�Y4��Q�'�J����LX�R<p���|�����@��t����f��n��M8�������KmX{�c�{k���,�-Y5�Nk
H&5�r
�O
,��x��#1.�z{�1�7������'��m�r�G�pM��U=�/w���w.�.�FD����.�&�{mD9���/u:�k�kO�@M�\�_r���p����Y�Yo+g��6�v���q��PN�����~����y6U=��;k���^H@9w+HY�8��;��Z���gL�7�}���\h�U�W��|�\�^�qd���b�`�j��Z���zz���bwy�DtoJ���z���� ��5�a��{�����	��r���������y2}��w��T���H�H���T�����e�����1����u��4���-�,���45�������7\k��	������{pw�������3����52�oKo�������|�_�;q����V�'�Y���q?�w�N+,�?K;C�;���mJ�;������21����2W��31�xMea�nie��nN���w�O���G�H]G���S��H�������P�3SS+S#-c��|�	�'KCSuc-��#�gw��/|�55���[���4�����I��D�R��	����85�w*�;�����������������CF����m. \r�������:��H�XO[�������[���S!-����wZ&��wO�G��/�_����mG`��8|y�����������`�S�h<o������g�,L�-)��?h�jYP��YR�|���j������0�S��RS�������GC���b��h�����=��W����
0����W-��3��������	>|���5�����#��������}��6�@i�n�gAib�e��D�����Z�,t��,��d���5��>n�_0|�D��a�����=�p���6��g��&��)&$����$$$,,,BBB


:::...AAA)))EEE


CCCwww__������������������������������������������+����'���=���������������i�2����c�.�WLq[c��c}�{�A8�QP	�N�.}���m\�wN�C��#��N�j�ze��%���<uv��I�X��"Pzo�����D������3����H1���,�0���B�wi��my
�^yo,8�^Z6�n��h������J�;�a!��ah�	��*�'�;n������p���*������nH$  �;��fyX#!�x����?��L�l�R����w!O8"��:�lu�.B����)�g~&��y�����*
>���uN`�v(���Bv����F���S�G�=������'�,�]�>?�A��VPL��f��T<6�������a��d�/����VCf�����~��~EqBW�"��q���zU#��$A�BE[�������q������R�TE�=�m�����c��x����Hf�O��u��;[���J�b�'��r}�y~�Iz�t��V����
W�d��IT��<9=d~�;��xM��������_�N�8$G^�<Y�_6�MJ��ZY��eZ�%CR��&�DI~g��7)R_pGF�2��ap+���rj�C��t�v��fN����N
�d�E��V���Oda(��3L.mL���^�x�I��������������Q&qP�^}H���?���l��q}}c/yE@�V$O���#��fz�{@��bOmie�m�����VQ��e�f�k��Rv��)"�V��T�s�����������1����-���e���w�
(��~/km�HK�d���%A%Ddl����]B�;
��P�Q_����1k���=�{A��5���5^��3��K� �$~�r�uqg����=ml,�)6:��<EO��f{ST��s~s1���OM�PX�`sP��S?I�<hPZ2�UG���mt�8�e���'y	-$�D��w�U�����0Kh��~�H��V]����v�~����"������{>g����`��J�:��HA����^S�A��>�	Qd�1\U�K���Aza��qu/1�]�4���u�
��Y{YcQ�]�(�A����?7T����j`c�I6gB)����89�K�/�
����Py&��0h��)�`S�^��?h|$9����'�n�~��g����!���?�p4�[F����Y(�������0�:�[���i�g����`��XYKG�d����fC������1@�*A-(��Hj���b�2D����dl6z���E���p7z�C$�9��:2��	W��YR�p��K���~���Y���FY��!���"2�mE=������Kc.����������F�E��r�����]�1XE���!v"�x�g�oI�?�#S@*T��,6kwMqu�w3DszoA�<�����b�7�L	�%{�Mg���5�C�*���~ T���b���4��:��pr@��!��b�:�,%v� <��U��:X
*��������
���lC+�����4V���g,�����F�`��C�-F_�Wh�b��"�J�c*�"Xtw��9LT,{���E(�fM<�h/�C�`��)���J>�������r�V"8-;���t(����3�������`^��`hD�faRx��IZ|�(�N�!��l�����5��Y'���U��q�����qL���O,��2��]�5Q������,m�x	~z��X�P�
�Lm��	>���g�������!B"��G�)G{�����
i��g;}S\Aq�P����|���pp=���p���@tm�cV.���^3��^<�U�}����2�R4�yw�Rxq�}C�������������u��I���*i��#�����>�����P�f�/�
o"M��?)�&���&����4�]R��q��a$���H(�fBJ��k��T4�xY"&�h�����uu����,�������I8���<q���R~������@�h�9��x�h'����=�H����Q���>�b8^^���(SR���C�Y�&m�at�n��)��Q%�8�t����B��=�T6������%l���
"�G5+*��>��JH�'����<���g�Z�yT$+��z>������"��?����t'P<�t��,�����a,�Y��I0�����F���d�����vP�Gq������}*������7��6EV���8���w���=�1�Jl�o2�m��*�J�K��R�ptXy*�L��Z���t(�]�G��=� �m�p������{An?n^q��Z��N�eR5dR?�Z�d�4a�_��b>\����:�{�8x]��p��a�DO���3�n��G�������%>�N��_U�+��DRc�-S##!?�6�X�����^,5{�#m�A:���QX�+&��6ll:���]������J+�T����	�aW������&=�]#���c�����{����E/J��T�5��4>�k���&�5|[����n�x�H4�}����e�<v�O�U/Xd��"I����~Va��{�0)Q�F��w��+�c�B������X�}�����.��)W"���V�j�3���\H�o^��	�4�U� ��+��!t����4�����D���;
��Zt�9u�{�qt�#�e���{V��N�!�Q7��Pl��9�a@���{���[�������4k�����g
#m���4P����>�o>�&aIz�" �6c!��E���G%���2X�������N��vU�}�O��E0��*���F�4&]0��%�x�P"����\����L/��Y������������,	����)��k���/��YR���#p!�:�#JK
�U�Q'oFbU���k����f��B�$��>a%�U���Ll�����D��Gn:�6'C��k~`0���i����q�3�}�T�$9:�G����_����2EPG�6P�x���=D(�V�R�i����
�0�����y7������R��U�l�����%��	�c��x{���Z�G��D����DNN�Z��w�S�����W��n��>O�mA�I�rV�(	z�����������GP^���	e���M_�e�Y�3e"H��>�z��POk���VD���.�p�Gw\h��:�l�T���:����{�p��N����EW��jK�0y+����e��1��{���5�������|��
9�Y<B�f�����>�:JVx���T��D�1R��j����>��=�^b���%�~��b�`��-Z��T�[�i�dQ��=�u���A��������^CTx�	�����^ ��5�����)z-��~�����{��u�}��
����:T}��5�p��Z������D���00�R<���7��� �joz����(b/�`R�)b1��,����:��������mlf�H�HT5)������
��L���%0
�WJ���8�V,>��1I:���Y�l�[��u�T��>��zT���o�����
�����d��c���W}�~ThW_���������z���A�jl�y��0ZlI}��Y6N��U>���OjZ�#��NQ�b��x����������������m�qk�5���<�j��5�gI�a3l���`dx������6g��[��������b����y%����o]P������(�U�Q����iV���sd]�*��'$�3:S�����I<����d"K���oyaJWCe��&I��btS0��rw�^�������
��� ���3��,�wR�e*Y~t��U:�:���!\���`�)�~���q������7E��e�UzU�+5g�����%���*��?P@Zi%�n�n�n��E�A�KJ���A�A�A���;�������w�Y������=����g�K����e[���p%����K��;B
��K��\�sO��-V�N�r�>�������5�BH��4p^�J�y��{��y�i ��W��z����������/G!�����B��_c(<Ya���B�f����-q���L�>��i�_-��w�?�W��c1(�v�'�)��O�hx�Ho�~Y��i���.c�.�q��)�T.Km_@�]�����'8�/���<�H�[v�i'�)��B���}�/�����UT�6+��z2���AiV^���C�/����j^C� �X������#wG�{�����G�3��Y�����)�0�z�XpV���'��������k�:n��G�'9�5�����T����i���!��p���/�W����)O�02��83�.�Vf��u����,����Q]�5�z�6�Y��d��AT9�$���qWr�1��j����x�l�u��\�����k���������"qV�peZ�_��e!�E������R�'�7B�?g�xd������yw��3UY�����I�+�gt*���Z
������I'"�0����������#���r���2}���A�6���G?�S2I��h�wu
�Z[Y��V��2-W�W{����N$HMt�&TX�~�H��YY�=�d�R���7/$I�9�W��#>b�]D!'
�G-^}o@7g1� jU���T�oS��.�Hq��q�hM��|�[�|
���#o>c�d�B�f>ci^�'v��A��`W��&yk����>�9����sYBd
�nV���������2N���xV&Y~��#�E9�����c��������H�8����Z-��iu�A�������Dc����T�]�L|&��N�"&J����T��jm�18�%��%��0�?�}j��Y��B����b�w�bwM�;q���#k�q���{�-�|��E��w{��J�����
�p�mo�7	�q6��+�B�;V������*��c= ���C�����2�?%��`P��=P�mk_���(4S����w|9]Z�v���`����mk���d��V$�t��
^��"�?�<�v�;��	�����&�#�������uy]��	����\���{�(�'�����W����cr�	�ME�q����s?���
��!����<9�E�����c]B��'��5H�M�O5����~R�	�;��4!�7,���3�^sqEY�2�E�8+�j����C5�?���N�?����!��g
7f�,$�W!��������Rok�r�I�~����La��{�Q��z��*����e�
������>�����s<;�#��+�Xfd5��~e_R#W����brEIs����_�[9\X$d��~A�/;]�f��+���g��b��(s���.�?�������q�R������\�@{�hE#r�b�<fK/-�]�]�"'�k|1�������w/1�xu�M�Pa]4������,�4���xq�UD���"��=G���'��'8'���8}vz���c����x�WNh��6�3	eLF�'����g��:U3|�z�OG]���$���W40������(:o~j��3a\�Z�S��B��G�v���-+|5e�O��$�'m�4��0��P���j9|os5�������/��'�=$i4l���L����u����#���9W���_]=�����rf��v�^M&�9��^�9�������;��q��$�0��M:Bo��&=jB���p���.�\��dy�X�\�����x]�����_}�xfYlt�f�������v�W��b�8>|���0��[����}�Q�#��EL����y�B���Y.i�����/�-
�}pN�b��n=b�y�������-��[�������g�Wi����o�r��k/*~Nj�Z\����u�Q�~<���>s�1(4�EZ�qR	qa����
g	��	��������/v�m[��R�I����n��<�Y0���\���-�m�n�Z�1O<yZ����O+��}E!&9'�����S�Cu ����_�V�z�wpl"#2�ZM��������@?���j(t���k�~vX�c��(r;QN��� C�9����<��|�Y'��F�/�<^$i����L�����C��?P������2���y} ������;��k�xTvP%�X�^<{1�n{��5�����{��4�������\����;:�7Y��)����%~A�&/��l�QPg"������L��C�J-�OXgu:�2a&���v���D��v6�M���� �����=~����Z��~��e5������z|6w���}c��n)LL���V�e�������QD����!}TH��R67�K\���@�����M+��2������{n�����;w�����T
�U}�9�%�����o'�R�7v�����������\�bS�������)����IN^��	��(���F���
+j��
����/�24��'����c��{&������r3��B�(�s����b����->������bKj��J^S|r�<D���,sX		{�}��]{NI��K"��Q��������71�o��
����)$m�]�z���\K.��{�"Tx<���Z: ����RT2d~�<���BQ$-�����2��34�����dS��\S�|W��)G��D�������Z���I����1�j�%����^��F�.��4���J�z�+��	�	c!_9��'����T�)����
j�N��B2q����l�cpe�VYEM1�(�����U��d8�od@N�O�eC�zF^�W��9%F������o:�}��HN��k���w��5����M�u,W��?��O���Y�1��f���1�K����	���%B�r����w����������z+��9��W,��w��x\���zo9���S
�7��J4���_�]�=�c��kbM�fyEj"����FDq���%��>�`/���3�?���][��>�U��r���d���z�v)�SE��\���2ps_��������/rZF��2!�g+1���������%���Y���B-��%I�7���Z5:�p
B�X7�����(��:?4)�M����wzzf��O����rH�����/4�3^n~vV������D\T~!��+�����r����\�T�X���~k�d������P��������#����?�=�'����z��"������A2��
�n���a�?_~�-��5��5���*e�9^1e���
�j
6��e����]y��uW���d���{W��l#���9�X�kc��En����a/,�o"��^����*3�)��L��&�����m]��JJ��.�#|��z�e�����hjN�O��J%?�%(B({��=Y�1�tt�=#��n���-�e���^6:)
����*�&	����6��cO�>��j����R��G1��l3�Ru���<��U	1��oh)k}zVM���/�6���9��F�l`=Hk3V��x��h\h���3�	?�w|�F$+E�����<����tH�*No���D����Rcd�+��0d-bkgT�'������&�0��f����d|����\�;3;8��T �?'�Q"�Y.|N���b$6�gE���-zf}����8��,c�IGKJ���7H�����3���kb�SY�ALV��q���d�uS���8N���f���m���p�?��e����B.d����(/��.�h;����Am��q�
>xO��|D�G��%�6� �WXn����.��8�dR���o���}m�)���~���:��r<������3}�������H���Wq<9�an��w�v�Ml;a�����Ln����tnF��"Z�IZ%�����������I��y����*�3����K}�m��P�o��\wc7�J]������5�g�xxx�����Vm��f���n��j�>��<+�-��su�����wR5BM
�+���Q~UJ�x?y��&�<���iY ^�A~�MK�/����k���S��B�tY��W��U�^mx��������L%��
t��>v/>��-ZE \bt���6�3�v{�A��q�7�y��>W:�z��"]�����=�%��9���N!�v�X�{�����/����M|��ie#��6���AO&��3
;$�%G��}��y�������U8�/�����3F�����Z���S0����v��6�LT�
)�3Y�jf���2��~�1�C=�:D �+o����r�q�R�����6��S����\\���>b
*���0�y_Y�������?O^�"�$�$�~����1���*VC���Zk��c���W4����������o?sPw)��p��w�	��`���v�}���{�����N���Z������z�[�����z�O�_F�����2"lE��������S�������[���m�F���mvw���w4��w����1|�@��6]�,+���?8��O7�_�����S��^�^�I��n�*k%uy���h	A���[J��
��,ja�TLO{��F��KT�1�	>i)�����^�`���gA[N�r=�����)���D����%/���C��f��[!.�L�'�R��9���z��'��������ORU�T�m)��8���oi�[
4���|�y�%�V�fw�|I�{�5�[d�k��'I���r�.��@.�9�5����j����l)�F������Oh�wg<bS�o�u�N���~A[��5NF�M�H�s��s���������Yi��l�p������c��&@�$mt���I��0���-�W=X�$����EF�C��u��^�Z��x4�=E}�������9�8�������=�[����Irw�����R{�;��_�����z7��H2�D�UB���Q�|'�f����{>��"<J_�l�!�mt�P���`�����#O/��&��.�D�1�����h�?�����2��������/��U��8���������������Z>9:���\�@���N����vcN�Tc��X��x��L0	�� F�����&j�h��d(��������~/�	�I����jW���L.��WO�����7�'lh�U���.�/����@G�.L\���*GhLj���v=JK���j��>\n3-��8��^���m�TW��=m,�6��v>>�],��6��8W�q��0r��:����i7��O�����	������Vw���q}��a�Y4��ZN�b�����Tz�y)�I���\���J���3���)���15|�i�����K�����m5���U������R�g[3E��h�&k��ig��Ei����L����.�L_�=�-=�M/�+;��_;;�@!I���W�wY��9i>8�^�\��q���=���v�-�����$�9}A.�jk"���qv���xy7`��b��b���%�F���9���������������&����5������������p�k�������|q�o8���3�����zrh�*������}�K�����_w5A����?(�5�iW3.�i.N'��LW���X�O�m���/��&GPf}�-��\�.�.g�N�[KW�/g7�i$��oo�U�v�6g>�SO�l��:==l��}������I�b!�{�@_��M�q������6����Qc����[�����K���h�����f��sDg�#�����w�wW����6Wi���.K[['(���Sg&\g�j�&�����k��Gk�jW�jS�4\�m(�c�W��W�v��jO���u��7$��8��6�8��<�����;;��d"���hq9�a"���?w`}��O|���~/gR��\<�L������4�\.�G;�5'��9;���Nr���s��\��|Mg#(4��T��L��`.�l��\�0]�{�y(���\�5=;|��/a�bw�����b��2z�s������_!�-�m�2��ZcNg���R��%�i�����3��p�E}��9��E	�����������X	L���Z[��T	���J]����h����=��������62�L9�-�MD_�^.O�S�q���.�K��
�\!��l��b����[�m��<0�����=��i*�[�J��w$��d1��(j�4�v��pk�m����[��|�.'��Sr�\\�@0�����v|����Z��������eg4NzS3��p\'��8/�.f���\���~��7��q�tu96=;��h\/e������s���,�����m�\������yc)��o�������^|]g���h��������3G�k�O�$Z��2Q:rg���7�2w��'�\��\8�"M:���x|��xn���}~�Rs�Z�a���Fc�Yh~����V�t�ubr>/Q+{U����6��92S�L~�:�v�:���y�Uu|��u1�p��ms���}��x��$�Vru��u5�{e�w[���%�������H���zt�y��L��Qn����T#O7CU����Fa|m�E�2}����c(�
1]i^��
^������]>��4�<KI�3��J2�]���o���V���I�~���iwB.��v���K�������s*�������"����j��3f�M��9C'=��YC����b�m<����=
���}���|'���'��):�b��	h6��;��j/U+�w�V}P��bjr��|
������Os�K���r8���p���X����2�8V/L2�D
��]��7�Dk&����0�^�1�\���2>���[Il*����,����b`�t`1�7�r��rmy-��[fL�e���6��\�gs�E���%K�����O_^0,���}ZUl\�Y^jk,�^���<Rl�K��.��=�[?� \|��''l��p�ElMnM�$�b�Z�{V���-S6������ur5�k����-��g3si��z��q:/}����_n�^\�&�U��S{(Fo	���NM��v�&�J���/B]*��fX�.xhL���8�N�2{u���*O./�.�����v�8�I����g�.[�6�kE��(�m�y�������zp�T��qa�T��V��������6���[	u����y-�e����_��8�Ne.�_:�-5��/��_�;���hi�93���[u��9c����w���K�yy�l� n{�<s��u��[	�>W����"�jH�U�j���_�=x��?&�8��Vn�v�L������.��������v>����A�Z+��9+��t�t9>;�nhtJ\c��I<�s�p�u�y�U�|���uQ�b����|~���q�>�����Q���i���U+h�Ng�4�g9~��v.gf�����<�{�.����z�G:�'Bj��$��\;%�����W��dN�r\�����|.;I���\luOu7����b]E{\�9�������# �
��������D�9e��y2��{��M����������AO�������pj����d�w~M��-���fA���]�V���@K�|��j�R7�*rFk�����;��VB.�����+Th7m�3������I(4��X���3�5�ml0a���{$�s3P�M'n�l��cx�,*s')'����	�(�Yj�=|(�t0T�!�G[����yh`����/�b���&���?o�9 ;�.��5�w�
�"p8Z���U�,�{��>u������,	�C�����'�X���k% ��J9�-��:�����"dV6�h^�[�1�=�i��?�a�w�.58$�\���]���r�z��,#8��O�������(u���J?��X&`K���G�;�N�]N/*�Y_�w��.�i��������w�����s/�i)%�!�zo���W*��,��O�A���r����ch���A(���?k�s<��
��P���>.s���W�~\�	����pf�0�h�FI?1�U���sL^*������|�KV�a��M��e�f����w7Uy��DFu��Bi���T�72~�=���=
l��p+���B�^xc�~���x.C��FM�6>`��J�:�w����E�w��h���z��W��kWo��1���<	�t;*]���{Z~����;qQZ����/s����+2��G���gV�>�����<:�
��pa[�5�6���K�����m��d�M98?P���y	��%l�+K
Q�m�'32H��j:�*}��P�)(������0J�#w��K�J8�-���2������o$af=�6F�����R��Z��N(��m��+���RX���U�<������$�V���!�U���%Ys���e7@R�P`�DFn���
�rI��L�����6:�_"s��l
���=%/���*���(���H��K\�g;���09�������R�����#�1�(����k]]���-#k�o�+_�WNW��0�_�������#
�m�&>�Hx\�N�j/Ry���VAD��Q�7zt�gl
��C�A��Lk_��Y������cZrcZ�
��=���/d���@�/���,Wh�Xz9`���H0@��3|*�NqAyd�fl���B�j�n�`R��t��_�]�s��s��No���}eu�s��z�0���_"����S���r�����gU����d�������H,^|�W&����I9^���r��RK�p�XkV�p�
��M���:��$���+���2�)�j�%�4����M�/��x4	0��m��4T��6=����~����Tpv�A3��~2[p���|�<�N�F��'�w�����h�g���R�����7U|P�-�c��o����T��0��w��N>�{0�U��U�ME:��_|k��"��U�c����@��{�t���#7�%mKX����m�<
�1���
*�l�p^������e<�>����9!EX@���y��:���8g�&�����=��Z;L@���^��R��+DOvS{������C3�
�n2����Q�t}�~�����cjs�Uu��6����Fj����O����]I������a�?T�,&!�k�~�������W��.�d 7�G�if� �3K�
n~)��bA@�=��S}�@@�U��'}J�U��w~����7�P�����1{cV����K	�z���G�pF,!)����+���p<�a��$�]�����\g
-���yj��>���s���%�p4���V�$��[#B���:W�nZf�7`vc�Z�y��X���X����G�U��H��������,=�m��]��f�p����Tx�k�:E��i^�E,R6RA(��@	RQF��2���a(�d�w����X�����"�j�g�����@p,���N�G��$�

�������d�4P_�G����q�C� �q�g����15[���#���� ���Y������b��`�A������*j����8��C�������/�g�����N��.��@D���^��j�X�	Ik��\jP�z����[��7Su��'��5�S��(�G��a�M���!�8�T$��	gm$i+]$������
��D>��'�@���p��iCdD���!C������V�Q�c����"�6vI#B��H+_s@A7�HUm��
���pZ�
�dckN+6Gm1�%����������G���b�����(V��0�>���!��c����[���gI�V��l����o�Z��p�m�8�����S]"�k�q/���e������n�5825*Ly�A�P �x�*����D"��aN����P���l�b�\��=)�ou���5�>��i���Z-�:f��>M��5�7.�r�����'��&���T@��!6F~����f�NR�:pKt�]��S���U*H~�'�i�����u�9�5J�&����s���p8��Wd��}	8�Y`�N�[o�e�!��T��J*9Vnh����l?	���/��B<���u��2##�y�����kO%U�;��#��3!�	C�]�s�t�/��Xb��%Z�}�eg�V�*�}}(m�d�������gd�\kVP��=^
��}G45P�<�J/B��6��K�v�������9zv)�iE�-��+)���p�x�Q��"8������_�8�A��9���P��Z��
�$T{�X���X��w����zA��
�|J,��S�)��*������P�&x�br�f�LP<�6����=.�&�����o��V���oJ>��G�T^�i�b�n�\������-���^>��/k�um����k���	'��"v��K�%��5`��V{��6n#tV*)2�r�>9�lN������>r��-bxu�������5�9�I�<��*�*�R�1���?t�Z���5��p2J������_�e����X��������������&�/�l5H��
��4T^�T�)����u����rce�����
����]e^-@�n�Y1����J������qt@�\��tE,.����G�xF	��bN;UekfdH�r��`�8�����Ki�z��-i����Sq`L�P���D���NVLJ���������cC!�7?t��e�.�6Yn=��2���{��^���n�����1�����������kN�L2wk�<��<y��?����z��W�^dJD�b��[��3\��Ah�������Nh3D�����h�rt�
���w�\����,g*4@��N�qU���N,���o�(����/�q/�����>���
>����{'p]�tV�)x�|�"��!�+oNoz�$a ��}��w������X���x5�f��	����HI��<N�o�I
��r���2|��d;��L�>�K�0j�B'�UmJ�c��P�@W������5�����/7YX��p�li����
[�����2G�N���F�d)�N%�e;�@����G�A�"��]�H0���k?u����N����t�V:?3h�E�J�e:\�?pm������w��Y���2�4c�!��Z&�����[���@OAz*�1�X����,�t�M9jA:��*����-D��F���������N7/�|������K�S�q������>1�Q�����L��Z}�H������>%�3m�,v:�k[Kr�*�K`�[s��_��������H`��L�1�H��)��^�k��[�0`8=yKU�������1)������G7�$Vj���@q��C���@#���{��T��k#g4������?��SM�F��g2=N���js�����GT��G�SIA����G@�0G��-?W���A�#��YS�|���u�$v��g%+�rL��@�E�����v��Ej����
_�vw�����3�6���Q�*d��v�����7���w�����xd���1!;��gKj�3�H����H���Tm��<����P.)+C����b>���<�aL����d�}\Q���D������NP6��MtH
�ji0�CNx0+?u �,��c�Y_��q����2 �Nl2u��
�t�H2[`��x��R�k���8��1`�j��!�5}��D��
�c���Kr6HX�����i�����b2S����S�@�l^�QLC���-:i��jm	���ux�,#�-����T[���Nv;�@���"��A�������h������j�Qq����N�i�T����H��R2�P��ka(�4�M�>�R��2u�
a�(���$���A�I����H��VJ�lL����2l����������V��j*����2UZ�Q��2�*��@�2=� �h_F����Q�2/��0���V���3~�
`�Y��2.&�X�jJ����J&em���r6�@?Q�w0��z�>�=���V&P3�=l�r�����OR����2���V*N��O��:���������������:V@�:=�JLw�{%`���[�0C�PN������n�D����v��y����~
�t��oW*]���[�p|�u<4�<2�6�����!����������F(!6�Z���n�-s=<�O�^\������E�Ha����*���������`k�b���	]�
/1��P.�L�&wMy�>N��D����A)k����	�7�`tZ���=<����(r�� L�&kS	�6dS8�9GBG���1|�zc}���Y�P`�\6���n����-�kB�]�O���N�6
��G-�\s|��p~�w�����O+���M�P�6�6~|��?0�K����1�����z��}1�@�������H��V���-Ea2��B�iP��9���2%(����5
���o�D?�f1��n����u���r��v.������\s���8���&�����sg��N}��FYS������lfCY)��S�@K�S��������@��"U*���Ld�$�Li���B��(Q� :���'�q�O����T�Y#�8�T5��5}�I�����k'��.������V*�,m���P��w6��~9�������n�^����	v6�s6���=��0�w(!�������(��>�^z{R����.
���hS���l��(~4��+&�kr�}�>����K�����)�u^�v�����v���$0���k6���c�����
��v��X#p�	&�~:��u�a�����D�?�M���4�R�����e�1�����[�<w���T`I��i��g����@�����yw���DOfKx���9
�7�H�\3�H2����L/�:�?�
�T��Q:$�~m�"�^q���<�S��N������,�OYb���2����1�b�@�U��q����W"�9�s�c���4�����9���E]���E�dyb���g(
=������%t�=���"@Yl������k����9�>;"0l��P�����r� .�9�����a
����9}��rJ����r��C���<����i U�����Y� �j�+��t��7���O?�-vm-���u���i��Au����+�������*��
�^�I�z5����Z��+�yVM<$_�Y���bf�n�f0(1ps��(sqN��1wz{J�����oR0Jj�/I2�C	�"U�d�6R�[�L���_?\�:�R����c�{��p*2k��1��J����j>�?<�[1�G��;.������17E����X��N;�f�Z����O��<�Z�V�&����-6?����|��s[a����x9�)���	���kX�Cw��5$@I��$G�-� S����6p��L����w�y�f�y�����,�g*�]��^&�H����\�+eu���|�D��[E/:���JjFscr�a �N`�	m�Q$V�0
�E�@��8}(������*�|i(6f<�����cL�������L����}?���n�fS���,+�/�
(��-�[ x���,.v�F3c�'b:`x��Yo���O(�:Z�{��E���K�@�e�K��@��m��G��P�BNf����x�e���)���ws��i�)�{�	�q:�8�g�8������l�X:��5$}��<I��+�F�;��:}����*+b,�+w}��Y4L��"��8���f+{���3���+��v���D�9��B�]_�����z�$��bA��"�E�\hSn�����f�*X�Y�/����Y9X*�c�y���pq`g����z����&�(��Jy�s�E��>5{w�M`��:�w�_�s�c��1�]\��&�3x��^`SR!t�>s ��C?�vo�P|oP���3�y9������8X�:�'d��c�0�1���m�F!_�U�R��X��DM���.�����zC��`@��=�>���J7���0���@���������a\�Xo���'�O�Sse��_����<+������<��0j��^Z�����S\G����5�1����8o\FU�!������-�K)��#�������@��_�2�����W�Up�o�p�=][|y�����#4�s�����6	�:i@\F��;x�t�@�VKI��z-��i�XnX����
�)<ez��A������`oW���+3	2����9N��	]lm�T](e�B������xd��������k��B��}5��H��H(*u����]�k�v�Y[�wx�F�@q
����
����8��|���l�z
:��;-9��l2,���e���a[���`U�.��v���b�0V5N��9T���I8-g��+���g'�Y�X#{���01��[�ysj����#���C�-e��_~/X�y��3�vA5p.T���:7��
'����+cg���Y-�����=����(�s��o\���[}V$|5��T���8��r�
��Z���\r/���}������;�$���l|@�����0��y/nN�c^�Y{���\�Q
�4,C�O�������f5D�i��A�
�bMO������@�>=��H2G����,Nh�����Js��^���l1��-2Df���=cG�bGW[���P���M��
S�5��S�����Hx��lD���I�����:���b�LcI�As<�L;NR�w�������L��D���F�������<�A"�d�LF�|�1cgU��j���c��f�h���Yh �W�� �^-����!Rar$�o�-��G
���u��md|����U�5��k��AK���-"�'����r�h-�`+�'��	���t/��8E���j�+�0�iJo�f�9K	fU�,��}8���:�G�W{�vD���B��lV,�?7)+���6��XM��O�Bg�s�7t�I���a����]�J���J�����m;s
sG�s�j`�����N��%�����`��t���U����y�Ak=#�:r�����KrN�vu�4�������M��*�c��$&��������x^�'�$V�)�����k���_����+`P9��g��?����i�G*1������"�p#q��Q������02i�K#�~i�����|F_�P���-�����i��A��p����`;�!Y"�b��0�����x���$����#���Kf����e�W���������	��.���]��D���dy�t��V6�Q����D����A$l��u��}5rzhuj��G|�k��s>��N��W	�����K������o�P����8���	(O+hW�jJ�@=#�$=� h�����QmJm�����B=A��8�@HT��a�nE5��Z��
���*q�D����4�Bn�6�H����Gg2�H��C��������L8LXVT�L4����V�4z*5�P���i��cUBks#��$��oQ��A��������Rj����^yC��2��o�,��[F[(.7�Gc����GK�Z��k��dr�s�h��%Z�� ���=�n�6kj��wUiVi2(���a3�A3b�0�G������'�,���6��?�R3jN�5���=����{��:CB3�D�;�8}�_w}���E'H:%���[�/��&�"���\�J�g�5�lw��B��k��#���L�����
U\!wdoe��>Z��P�k�������j1����r�U��c��e
!c�P��o�I��d�E�W����<����[���M��dqI���	�vB�p���:���{���L(j��mV[zIM�OvJm]t�=2�"UY�����o����j�J�v$w�o����F(�uS�������_p��
9��1�(-@;�2�+�.�G2���d+���
lC��<L>��\6��}�;�������gm����,��|����������~�|�����7�����h��i?��'
����B�t�d>�����|6
��9��~Z�a�����g0�Q���; ���w2���P�>Z�������M�d��<��Z~���]�bs��f�a�X�f���,nf��� &���1��7�qm��������L�8���3\������V�+RJ��Ki���1a���`�*L���$��S����~]�L��RK�8p����j�����2��o����m����Y��D~���K������M�BYk����pH�,�~��(�^U[o�s����89)��@�o�{�2��������
`�wG,�����.�O@;���2}�&���vPJm�:53������?�$5�rE��C/����E����.X�Z�e#�bC��k:����Gn�?O�L���jC&���T�;��b��@	��O���G~�\tp������|f�2wx1�#�|/0�1�>����D�J��oB&9��vwa�f`�����M�u�.�gd��&B%�LX��tj~�L����pM�x�+��	����)�I�_�k�)�J�b��"��}~�k�f�`7%�
�+���1[��d��X�j����`��WU�R���0��������D���<�1*��m�R���6��sA������h��jL�����R���X6hu���k�f��vV���}E�et��f���dw�`:f.!�q]|Q�6������i+!D�z��^�L����\&Y_:�UR��1l��7������G{gO�O��^�!�X����Zk���_'L�z��������?���S2A��'�_$���w�|.����hoF��`�*��;O��t�������w�W�;/N��>������S�������L�g�aD���>���Q�;>�1��X��������+�A[��*$�Y ���@�����������j�iK_��^=#��"oR"����K{g��j���3:x��{��
�C~�7��$����x�	�����5��SN�qe\��'�u#��^�9������������s����W�%���l����W�.R�l�
�1Sph�]h�j������<.^�U��O!�'kH9Fx���l�_t�#�>��%�[�E��t������uM/����H��u�`��I��m��s9/�L�L���)�����+�J;{�Y(. ����}���s����B��k3\1oB��`@N��b��G������C�3�}����]6�Z����o����#P������l,*T�/g��m�hE����_M@'���|0Q$sR��*"a�:Qjw�by��0$M�-�s~���`A�?��t�^(�fC}UC�� �V���O����r��D��&���{��\�Ogo;'��3����d�C�~��L�@-n�%�D�R��;���c5���������Y�����
���t���v���������Y?q�@p�����7�����9��CJ
������h�72^(`��q9���4�D4xcI�d��n�%��0E{pt��`���U2�w�Pp)����7%D�����_V���G�������>=b�g0�YD�6
��@��9�(���=��7AL�4�����D��%[q�L[ e�iO�
�6},�h��uk� �s�`&�vj2��s�%1z���A�_�2����:_�{
�S��:�:Q������F���4=x!�����q)\2�X ����Q^+��c�M���Rx��R2Wx�[|/+�|[�����v�������x=�0�Ob�����+�Y�!@��l+\|��9����[/�^w"=j��^��EO��(z��U�����nb�(wt�
x�����3�� <���Y��0���J��>���������2��&�d�:��a��B��;��|:2���2#��q�����6��4�����u_���P������f(9B"�Jx�������<��Ez�hy��^����+nW��|�������k���x(�Y_{�n�F�%�����v@o�����mh������$E/���*�dI+����\�v��m�g����V�.3{�w���6�D��a�������n��V�.t�=-��B��z
�-5�MCO&�g��6�.����p�1������@���r�H�������U����c%�J�����s��<J�@����������uw�]�bv'��.=�vN���R�~�.��$������Y>]���u���w���(�r}�����������v�
�m���ZBg6��us=LqE�A/!�["��G/���.kgA��m3T��c���k����� ��#��g���o
�
l��������/�.������"i��\�s���eR��|�@��X���J�}�j�.o��,U=lO��g�WtT��L"n��$����U����R[��#7Z�G�/���/�`�����]RN�HI�C�|���~��Nxq$?4��$��������#�o���>������+������Wo����Gf���
"X����NB��1��;"���������~�Q�a&j�{�yB���[�#��������Q@�K������=A'R�K�K��6iw��3gN����@'� P�B�p
���i�"���%��wTr8�����;���UG��m�Hp�vU ��B!|��C��c�#��$����J�/IKr4����KMe��J��$�?���[���E��]Pf^�]�VC����'�V:��Q|�������m1l�]������s��E|��
]MYJ�_��8&��y��]�����	A/�����>1�����-����pu���:�g���POR�+J����|��b��!��7�C�z�����#D[��~��"(�T�"�xm|x���-�=��]�w�S�a�=�u|�J�������U:���
��[� �\w���}q-�����l�1t`l���,�`��F�����,Wt7����G��F�KjW��J��2?3�o��<�����m�z�^W'�?c
����8�gZ���j~��!]�Q�T����Rs�u� ��X��'%K@|Q(��%_�DxYb@����e�\����(����x��1�������c��R'�m	�w{�
�c����^CV
��
�{�O����!�T���|���?�~�3��	��� ��.b���;�>����]��/(�����������D�0����i�L���p������e������[��#�u�)Q�hk��n��|�:��]I(�0�F�2E��
�������n�%��b����|	F��������zC��]���<<O��n:x��dqx:$C9������rI�^��b�yP������v>h��r���O(#Z����P�U6q#��o�A�3���7��?���+��k�i�{^���_S
������}��
��[����Y���t�H��9/_�2u%9�Z7���3B�i0�_�Sp�t���';�0��������a�P Z[���U'���f�����M�Q���%����`�Tu������F��&�
�g_#M�)�����@�B�b��e~�k�t�&��n������;��3TN���c5����<�W��4a!i��R}L�$����s��+��3_�P'�:J���E�v�l
>���Z>
0#l�?P��^� 6����4h_�n���&�����
�-�)[��O�p���~�/^hGYz��l(��o��������O��G��3
�G/�^1v������8�X������V��pPAh�u�q�.7��[�	�%���H6�DzT7�T��
��7�]h�s)���7L'�IDp��u�J�n�geN��S��I�rG}�j��������S��"_�Q:�D�A7�C-������e+���!f��|��Rz�9~���:�G
Z7Z�~4�d�$�k�)z���y�>}'����[U��5�,�s��(3��&��I���b�x�-�����F���y�������,s���R|�u������k��wZM�If
�)���Lg��E�7vY�c�$��"<�����T���2Pl����P��em�[v��h������k��U���t"6"�a�F�Rokf�{B�g�2����.���,�d��H��E��:�_<����r�i����E�:"�q,3Fl~��wH�EB�iV��:cP���)"�>����7�&h/s�vE9�H�&#q����J��6U]�vL���ez���*={�7�r�7�"����!�||c�����!�y�>r���0��W}�8r�Gj��q�P��'����)����d���z��k������@=k�[�e�%x�	���#�A->�:i�A�e2idY��,�:���&�J�N��C��s�{P������-�=��h�`W�������g�����5�Ar��o"XP	��7��������fd+��d����� �Pn�N�R"d9�d�������-��^1C����������e<����y��@��ZE����V��=���{�!�@GFP��a>�V�e�cD�7���i���?�P�Y��M^,�e}��*�������I=���;�S�Q#��n�)k��6��#p��$���y�W��{=U��ku���`����b�Ah0���
��;�0�A�uvoL�����/��I�{uY�r�O����,��
�l��f5$�dj3�t@[�@�v���b��~�p���T�	�����7n|�F�q�����������������	�lgz�0z�B�+�$�5k�@�6���3�u\�L�j&o���E�����H	��o��2�ns������&����V������Q�7�o�$>|j��7�6J�����=[w����>��G��j$���|m�
����*(@

�@���f�'�����t�T����_�e�������2{`H��3��T[���N
����#l@6�:�=��\�y�9�^���D����aej)�R`���p��
�ZJ���V0/����DR�]n(��W��z8x��vn>��P���Ys(���������C��,oX����n����]�S�X����}�������FS&���7V��V�8����
VS4�'�����4O�R�}~�d�&|G��r��9Si��ea]�����-���m�W����o�?���>$J_��)��&�^�`G��������1,"��B{V� �7����+�����CpUhY�v0�7���R�����R��V�(\�*���K��x��k������yoG��O��};�"U��E���[�?��|/S�w�D�>�(����Y�Zv��.�X^����
���*[<V7\9��?^�>��0�\m2RiT�h�mD��s4���R��.<��k�^3
�#7�D�G�X��_��%i��t`+��2��� ,o�~����3���#<sX������l�\��J-�>`�,�
O6x+5T��}S&hJ����zp[��[��-��"/���T�������(�j��=�D��::��q�J���X8�|��nvK����S;l����x�d�r$R����������o+g/�kqA+:N7������	�B�6��\t[1oKF����>W>�l��I�V-��P`��_P��uxS�����������>��92"���okx�2��y��AP��a?pQ����	�[��}�r~�2D��
9�!ks��FQa���D�!Hd�bK����[X����2�-|���#���KN�%��b�L;��N�x���'��<�2o�5����{�����^��2Ff���Fd��X�U����-I9a%����0�A�)a%b�^5S���D���/�����aGr�Grh�+���I���/��4H��.�'�8�B=����JH�L1��D^Q�,�Pc6��cKa���(�����[jW����Eo��mS}J��Q��~�v�{]���t>�
1�9���y��L���[D�Jd�����ROZv�>0�!�E�v�D0����I.DS�?3�%Gng��s��J��
�.h���HIn���p�#m�HH��FV�/@���]���C��B=
�~����S���5y���OL�� R��#��.��.����8���b7	Ld�H�QLg�����z�#��M���s��3�f{*4���\;����?��%~{��>q���� �H��E7��K�J�H�P �8y5������s�e��c��l��v��� �;����re�ro�����_*?w�#s�K���[ ��W������/Q�u��Q����R�]�5Q�c{��]d��q�e
�M���z���w��$�8\,����s8�����S?���g���u�6;�SZLmU��D�Z��Z�v
=���u_�T@����h�G�7���a��c��R1� ������;�v�#��V)�I
QN{IrsQ
�CM��2M=(�6�?NOl� �������a��u�/+����8��?
V�d�iH@�g���:�&Fg���mX���+P'f��W{�SX!��:#��;[V�����6M�`�l-�_��
x�A�����H0<��������vY�%T)-V����R1�)��/��?X�$���]`?���p�u\z�E�@�VI�N������C7I����k���8���"."�u�F���<�`����,�%B^�42.1U-�W�U��
)�fJ�Z��;���t8�z8{���gty_�Q�uP2���:D&���b/�-�����r������4�F��@M��6���2v�J�b2W����~+`f���(Y����1�6n�x8]Ev���d	#�N�0��:�
6V?�p�����iTLCt9�[������VU���$�e���s����4�A�Kg��K��q.�p��� �]�\58���a<�&�)�����u<@E_6P���Z����,�cK/���
&��Z��V�)U����.QvS���9���8���;�X���W��W��Z��Hs��Le��UV� �V��v�b�T_
z���p��8m��p��{If�o���	4�xU��]re��W����I��;k���l-(.�w�k�C�a��I1j��]4j��M��$�_ca%[`{�V�Y����D��3���e���:����K�I��������e�(��&c;�&������������6j����	��B�)��eY7����$f��~7��h���_?H���h.�,yY* ���dY����f��a�O��FT�n2�
�8�%��*r���G��or��:;%��p��a7�)	�y��L^�@���"��}����
��/�%�\�|]���x�X���\@�>1
j����j?8�����"������������e�L;2��qH��q�W����m�IMUh��)`u��.v/��:"�7_�K���|�����w�	O<Z�o�D8����f!t�[�����_���m�{_V�T_�b�KvR��Io}���.�=�P8��,1n����E���5���)��|0uP�z����{�5���b���9��*`K�#N������9���?��3�tWX�����#�������o��#S�`mg�RP�����	�0�Q1����6��e?��>��{�:��Q��v��-w��8�,��y[�r����6/��Z�2r)y'f�b^C�����M>HK��~�uv�+��mv1��
��Z�#@9��
���2(���Q+��\R9y���,*�x�.���M�FS������Csa���q�V!����PB��Is��^eIR>����]��
S|m������Ivdj&g����h����z�����/ZM
�<.,���]�����s3������R����5�C����	Q�������&�E�O�I�@}Z������#����R�D/�`��t�=K�P����W2��nJ��>��S�nBvd
I�`���������U�Zw�5�7`G����W��[x��]G�#�G������6�����_��"C�h��f�sUC{�7����[�.�
?j��C�VG�M~�8��lm�(�7����+����.g���Ni���jH��Y���U�/t�4T����
���\$^~�6��f�T��D��	���5<T+����%�
Y��ofl`�
7pl��������u��}(�z�
�:�MU�,c���&��1Y]��:�M����!��T7����q,�C���9�Q+U�#
D�����a=�M$�7�J9V���E�YH�UG|b`\���s��bA�o^���7)^��0�
4�z$����v�[_r+�Q%�X[�������
��dY ��<j�MF��f���o����/����s�W�{������C�U�kQG����y�" �-a����M�����m�����G�O���b��&����)����������8��	�����7����<�e�CI�LZ\�u&G~qt���hC���A2[$���Wf 1�=�H�H"k�W��O\��_�?jUz��A�
WO��K`@|���
v�K��H;m3��LJ��n~�9���Pm��~O�Gn����^Dr2�	q��;���&��2���x���?����X�U��z}�m:�!��V���������1��X�m��]�]���������H3��`C�j���/���O�\��������6�.��t�����k������T��x�-�jr�d�"�7�*v����t7,q���%��?���)��|��'�Z$ob�a�(9�D��H�K��0H�;%�������=���g�-���7�����5�'�h�7�){��4Y��$�C+kg�������do��Z#��W;�88�j���%K��u�}����/���� ����_p�"����k@�j�R
n�S������>�(��P��H��%C~W���#�:�'��	��V\� Fv���,z��)
F�/W�T�b�������F�{"PLB�J�,Zad���bK���2�L���a���!��R��vgI2�N,$��"�a��K�p���R���������w��+�u��<���)�8���w/����oX@G,�[h�WU���i3�D~��a�.�c:�.��2��C�AT��U������<�B�{�em|��\��ox���Q��xc����������2�4uGi1E��q���)��>�&���)u�@��M�&G�c�G��D~�]hR��#�ji�(
���<�y}�6���B�.��\G�� ���.���(4!	���}\�xS$���6�L:W��(�S�T�� nHm�dv�VH��o��@rO
�Ra:�Z���r��K��L�d��&#�2�T�����=H�[�}+�'�#������:f��_W���&�c�EY�������yUO��a����g���X`N$(��`,o��h�,�2n�fOK��>t�`��*eU	�!*A�n�R���`�[����;�k��������@0��%��Q��>��Q;�q~Vt>V4�����x�|����f�y�Oad���Vnl���8jcP���R>�f����K����1w��c�#���|6��&��F�I�;����-`��
�����`��c�b	Q����6:���~l�,����aG��������������5'+�3���B�A�i*4@S~9ft�����%/3��6��'��0+hg������E��l�kST�S�&���>�����$�=J��,A������'R�r%Y����E5
q��w�-P�m��QJfkBO�8��I�;��Y���P;����>1�j�)���]�o�O�%MY����>�����?��Q��de�}����K���{n"�{n�+P�@t��4�9����/�E����$hM��]*�j.Qf�+���B_yk��cZ�Q����_F��������CJE�CJl���������R�si�}��8����V����Zt�����M���%iN�!�V��k��0K�+c���[e��������B�9-K�0�B��]L���������\V���l��U��]1�+�y��Gv�����%�j�Tn�0fHW��2^}���`�����.��2�9+��/��P��G�[w��^�
ta�aC�/auz[4
�2H%��w����_����.d��:!������Qgt�@�6G����<����q^�P��(��%K����������~
��#�Plsl���[����a\[�)�v��I�3���A���q���c��~Z1��)��VH����{���V�E�2��s�rD��>ir$9\?�7������#��9�"��%����0��i��y� 
2�poL�.^u�w	�u���4	_�W�t)��x���xw���o�qN�QjTFG����B0<�*��.A�����@.`�x�B������v��4c����?���8|A�y��uOD�\�2��������q�
cD������c3��g�a��*�>a���9�7D�����k^���m4����������[���EO*���d��B
7�n�����-�%rc�_���h����'��c������&-��~��=u*t+;�������^�^��)�X/��!2�z�!*��{��-�
��f��}����w}�����_�2	�S�E^���$'�_�)�5��t�y�%9p���F�f�hGxc6^;P��"d^�sj�BM/�^',+���I��:@�c�{{8�9cR+�\��R#������lcd1�HBq���1�I��������H#�V��?u�1M�E�����[t����7o:5{7L7����0��rqb����G�)���/`S�h#�%��IE9?[s��)�&�������d�&�9�td@�����6n0&�g5���n4�����'����V�������E2����W�2q�����kX��ovTN`~��!��7����vY��fY�n� ���!��6$Qp6��1���u��c���_������etaV��%�Fm���C��q(�5����7`*�t�&�:���t��	�S��x4Q�I�n`B8���TD�HjY3��ojy�Q��~�p�Y(3�����z����27��G�5�������V��]�f����)�~!%[��]�v#�!���I��l)���ue�F���8��k�,�����������O��_��@6w>q����/
�^�g!��R�,����[����i�Z�	����:���{!S��\�<�����,�<����^��ii�SR��)d2�eo��z������%"#�&��)����>~���ZM)����P���G���S5����I���?��n�F�!��X_�P=L�ug����l�e2%pqG�=����Y��Vp#�
�)sc�ryo�����K`O��b�N�^�#~�[���ma�|���Tm4������Qw'��-��%�wi��G��a��OKC��b���2�I�&��x �9M�1f������f�c��
0���eu'��~e���^���?���99J�;:�� �=��e���x��)�f�_��S���rY���D��U�+����j�kD�@�b2�l�g��aS2��;�H�9�e�=���_�t��L�����������"6����8~���s�w�:���Yp/b�����	���#�Oj4���r����Tfm��]��t�8��"���J�Y��7I�����`@��Qr2�D��"Eoe������m����b���/���"�|�{R�A"����)�����|^
��+��(�'���"��~��k9H
\>�&��O%�u'���[��������N
R����_ �:n�>�O &��!��Uf���Yh{�=�_�������V{%6'��f��fJ<j�����n`+l�O�I��T���=���k�s�	H���	������q�W�0e.}��y>�����Ku�5�z�5-qEDG�f����:a}���]<�s�0���~!�2�M�M�c{Bb�Gg�A-+2�|�+�l�3�����B�p����U��*��>�6�y�f��N��i���0M4���1��z���<B�|GoP^��8B"���P%�O���u�kY�[m7p[v2���
���e���6a�������[�L����)��j���L�2��m������|>S5�L�g��D-�i
�4�\��mZ�9��,T*�e$���C3��2|�w����oH@NoU����YA�E�-���R����a1���	~Q��
%Q�^t].�5��Z��m}�����x5ZD��Q$u�E����1~�Ar[jXf#�s�`����x�y�k<XK�;�0[I��Q����)�hM+�r�TN�g\b���p��x��>0L��@�3q��0h�	x�N/������a��!6(,�����j����dpd�z�_c:G�����p�J�qh�,
�-�V��U(E�g��\��c��#Po+ �"����/L���xn ����-I���������,���ao=�������n
 @����0�]nh��AuC�5��;(�����_%���U��1�5�������OK������&���o"&�&���t�~/4���e��J�Ar]�-�xfje�x�A����9*��������1b�v���]l�w��-�����"�D��1���a]���<My4K�CS�YP���rQj���ym�z��k-�/,6���Rt��Stv�[�i/u����?A�	��@�p��dIiV���"9�q����"����_�n��
>'�R��2�O���$��%�!���!&XH�9(~��d�+�?����]�W0��%~zlJ�$\�����7��{9�Y#��#�C@=T;����vY�����K�z�w��������~OH&�`�_</��{+�0 >b�����C���E��UF�.q ;�{��
OO�F��^z���[h9�$��Li0���
���r���<����n���Ct����;���m��8�����t�GP�6�P^��~�g�q}4�L���
��Wh}�;��%�h">��d����^2R�B����+�h�iM�lM��A����D	������D�������>����o����/\iH����W�_%��0]�L��
j�D��l����{�V����C���OSD���1�����R��3
`�
�[�	3�|���N���9�-��_h��@�F��,u���D�	?�d��w0�
�4����/,�`���
K���,e�H@j���0�����c/��Q��������m�3���:���>&y1���x`1�=��>�2���8@P����0�����dJ��?I�~[I����&>�G�;�G�#z�[lZ|��y���5P���;���3�Y6#�o��}:�������$�B�����N�G����|J &G�Z��;�/���o�{��7yL�HHH@��f]����B�a)_mv2�U���q,���$+����0��a�w3|�7�������0Z`���G�,A�U>#������sn�����o��_��2��4�W��M%\���N�5n{���'\/�����C/b��:!�����1�
Z)�	_��&6A����>/XR��VbJS�G<���
~��uy�0v�o���o��A��U�kb���v3�5�P��(7��A�����D��/V?}�F���D)pS�n��*������yMN��{?�q~>������x���SqG�a�X�5v�w�}9��e(��4j�/G��;�� 
�����m�l�+u�}�`��C���ik�g�I�X�t
B�=M�o��8��N;|�g
S��5��&>
�jt���@�2Q�6*��u��a���U�s�@�y����6�6�pE7���q��)s���V�Mm���P�`��y+�O�
������"���^�Ci�:�����I<h�D������/���@.MM|��@B������a��M8AZ%��+�Wg��E6�1�� ������'���}��Ds<J�m"2�F�vr��k�aX}TXSP�j�9���(�A7�F�	�DB�/B�9�%����`��/z0��`�s����q���A6���oA0Fq��
m�'Nl�SbG����'
l���M�G�X�!}}�" ��=<G�i��v^�+��l���*�`+S ��xB*7
��VR�E����8���4G��k���[W1��o����KY��>�X�������;��M��W^b���Vo��&r�>����E7i��fa�����?s�F�(�����P��\�b��G�)H���;�n����/^M4�j�z����W�z�_��r4JX41��J�G������<^!���cw
~�K0}��x�����~~� k�X??HJ`E����z��M��t�n� I�n��I���wx�4�+���|�h���N�	�r��/�������~jf�h�W��c)o5��`�,�nk<�/�e�1z&����%�|�J���"v+��v�1�x
e�P�&�7e&@V�i�y�f��p(w$���-��S8��P)Z�m��0/�Qd"Y�����hW+������b�6�]-%�t����u���
k~Ld��������0�������?F�A-V��,�`�>I���"����PG����L�-|�qL���S���?AA�/A-F;)&rJ4a�K���9M����6���IN���������94=ncz�u/�KN�*�f�S��o�g
O�J�C��������ml[�����@��+j�����}��B�������l.6���c�jy�O-A���j��#��$�������I��K
���[�J\F���D�v�v�Am���R���q�^�c��0������K�Yo��mY�S?Y��"���Dg��������IVX�E��9hH������}�9��b��K0��i����f��lj�-�S�5^���1�(b�_���~2D�8Fe�{w��8�nA�xu��)p�(���hz������+^��}6�dZn@��	��n�q�&���A,�,/�x�e�78h�������YoT/������� �>47W�M�����f������N}�x4�L\��<���-���Xy_s���K\����t��{k���?�DC�NQ�S��=x���}+"��I�`�s@�E��Wld~8��
eyz����A��Y�����
�M���I[)>�qFv+���S�S�R��?
���W���/u�W�D���Bdsj����a-��j3�c���8)i��T�
D[�7f�0��
8j>Mg�\d��Ja���O�N�A���~���-_������UF�8����o�`�=���D����"=ZV�M*�+�M�+E�X�V�ty�E��Mq�M}P<�;����@�����4HK����4EW�-�H����?�5�����L�*D�#a���}	~���K<���<�I��nA����"�����w=��OV/d�#�Z���O��4Z����x��#s~��oT�I�A�C� �F����pH����=��/��{�����}���5���7������:bG]��l~yZ��-����V~,��+���/��f�Me���1�)Y��� #y��?���N���s�8�%h
�z�>�A<�y^���|���`+e�.�W&���������2�|������� 7L��O��!��.0�CK��	B,��C
m�Z�R�Z��<6`x�m�%���LH��-q��#�����7�*�h�� 24nQ�D�} d��5��� ��>G�6��>��Z�(����:$�������Cz��%���a�p�_�;�kcY�S�w����&������G�>��&���4?fE��M/���2��,Q�K���������t�����~su�������m1G���l0pw�(M��4#�5�Q54�H]0b��q����^��XB��T#�\�?�����A��T�>�o�$�VAT*��+��!Cn�n��"�z^���D�7lC5�4=�KSF_-�=��#���n}P��S�����{Q��@,X��!r���a0H7,w�,��)����K�G���#��s$d1t>�`���.�+���??�
=�`|	�S�!�[��2�
Qb���7�cS���`d�d�F�?��?`;Q@6U���z��;:�~<�uC���3����2�p1��,�m)�w�q������!����A�0n}�<u����/����}!���p�Ua�7��V(�^\�-R*��
��������3�Mq���x+�T,����E���rC�-�`D6j���]���%|/P�cl
��Z�D�0���q�*�^�GXe��[�p�|��3��E�B1a0lJ�#��U�p?��iO��8U���R�a N���������6B��1�t#�Y"8���/�A��+�k5��*��1\u�3�R<�eAn�l�S���\�x$�#������>�����9��^�U��-g�_Tz���+�&��� wc�]:������w3���"NK�L{^8���n���������=��T�B�y~��K�������)2��i����ve4~��T7��+��@�����DM�����mO�K>}�����~�����������#���h�����W�N�u������5q���id7�2_�����7!� M����	FY�����=	����3��c�_� ���X��_�y�pL�?�d������+T&O�8A��G�)v����Og����L����oX�-�z��~=�W��c�<�u8�3x��(���vHqN�������Q}�L����Q����y�S�kZHo�'�fV
����'uJ��
FK���*m:�+!0���_;�1�~`���c�Dp�o��G$����o�����|��e�?e�Is���#��Fd�y�O�!	����Wn����
s�V�����������(�����PN��V~��lK��*�����{���/��C��;�&6�m������r���0����gtg�J�7	��,�b��l���p

Z!f��Y�j9l<�R:����%�
��I��2�	��p�u��ESE:{�|��	�g}���B)��P)V�M��F�"m �5P��kd��3�N�f�vH���ls����
����O��4A�]�,J���@W�XC��Jg#Z_@,1��&@�-n��[["6������h2���G��k���/R�iY��"n?�#=����!�>|��8E`�7O3��l�]��S���K%~{��}I���P,���J����d}c�X
0�q�8�$rs���?F��^�<Sx����lH=W�c���O�
9Z�������l��|�j�b���2;��������7+����$��	M�M��O}���oF�������t����_���-,��;���5��S��7d/}���d����":p>������"B��B�����`��s�qW�DtA�������_p�+�m�A����om� ���B�>O`��U�OW�y�s��c�����7��xh�k9��?��zpj��d�r����-�l�����	�db���o�|�c
>���<��H��D�o+���3�.����|ck�1��Q��y���g�*-y�N�M5w�x7>zIX=a����*��Z�t`(���7�X!�+3��>� ���Z9KdKz_��}��@%����=��@��������
����R���Y6i�GcC�e��5�gp���[)&6�@�W���o2:�e�$�)��r��]��E��f�p2tR ���R�����$-���J��B�u��Gs}�f�1����
��*(���g7+��c�
�@�������zV3��^3""�q����}K��dVS�����)N����un,���s�!`��Z�G��QH��]����B���4������i-���582^�������e��[�(8�m���>��[:��A�l��|b������ui�M
�
`�S�F~���B�$��'i/0�Z�j�?E�t��v��y���a4�\��n1�s���C�<n�������f��o���z&��X���e���3�q+�Zr ������@�X�>��X���%�r��jS	��Ew[�S���*��
�G��������hZ��|��Yw���eCN����l%��8�F����1�/k]^�#,e��"q�	p�a�0w��'�^Sn��l�?�{�T��P5p�;��5]8�l|(<�uy�2���\`�a�p�/_��8J�,����f�bc��c<����!��8�O��B�3��ioJ��(a���4(���)�
K��QzO��7�[��0�KKb���4�5�,�$4X����!PI@���|��(c3�! ;�����L~
�/c��x�P���d��xf�k�p����&��.��sF��C������O���4��n�3%�|Tjt��+�p�Q�[p���]�i�o>aye���b�jG �����<:d������Q�l�
5����t������4�O���db@����������E!����l	w��o������;�����L�A�!����a�������k����>�{��C�h������}��=n!�|@�O�0E>�/a�'�bR���E��=���������g���A�_A�C&"x���G�����_� ���	yg�!���`�������R
u�� ��L�IP$�����+a��'?�8����W�3cC����^����o2�jmIA0��)r��w��G�~npi��z
�AF��� �"�����+��a�_��L�]"������E���3`Z���]G����7���
##|{
K�� $����^@W� ��� �|o�]������&!��������9�0������ C|��+�{A����*8�C��:�x�o2�(������,P
��k�����*�����k�_,:��)Lf��?��$�O��������� �j�N����y��zH�FS��t�w���Q�����$)	x��M0~}���;O�
�	����Cy%c��{�w�S��&�������������>'�E�c��o��}�
���>����^�$$�2�'�&���O�����|��F���?t�V'+|k}����N���|v7d[����r��?�%�Po���{��N�*��X�0�]i���{����P�����w����F�����
����$@����B�M������c\d�e������t�w*����4����_��S����0���Iu�/��L�9��bR�������[y=��4btcj�jHx��l���f�>8|y	\�W������O�6|���L�����W���u?t?��7���:J�op ���3]-8�J��yo�l���+�������G$Z����C�:������$Z�0�`��$	d�����9�|�r���g�3<����x��s�6b�PK��m�i&���6��'������r�K����UH�
��	gG�L��������i���h�
�����P������L<�	�L��|�������B�P�����g���Q������H	7�����&���=?��'D$�\��r���
j0bK|`4[�:�I�Hf����8���3ow3D�F+5�f�)�Z/��E���r�Q2�>���WB�������L�����(��������gJ�9�������%��;����qB� �������Q�D�4��>�v�����s�P�����,���^���i�����<~e����YU>_w�4)��h�K�����d��d�q���i!z(�Z��
��i�ii\�i���0��C����.ekL_-0�R��[���|�w�g�	��������C
�e���{f�d�^�p�hk��6N��E�������1q`�}�Kh�d �	Cc��%��%VCb/�h���/�5�#3g<�Z/�&}�������l=/i���Tv�|������,�����Qq6��0!@�w"�
	�I<���� �C����:�����0�������{���?��Z��d�twUuu�Q��u�J�?[r��?F�R�a�;���4!���`�:��Ej�a�:��\��\���~3�����7
h.��g�o)��
��qQEDW�[u�/��
�P�?��D��@�u�����/8���a(���������w���������"Bf��^���bi�n�}��h|������a�Q�*����R�R=��.d�����nIz��F[�<��?�H8��;����@�o�C`�^h�� �ThY���;a�I��8���X8~.���Y�?�:���b���Y�����DK	�����G.,�����S�aJ��M�#'��yu�j���$�2K,��Z�)��6
����A�G������F����,Jt��X��WI_J^}�s��j���%����_�+xm��}}��%���+�{|OP��om���	�.��L6��%�Ol��i}�����I��f�}y��)�f��|A�
�XUV��#��,��X�f>���(�z�6|$���6r��u�����0��n�*9�i*��1�{�D���/�n%Z^)�p�N��9r�!'M�	@��)�S�O&%���qL@�����:i�?R��������x�t���@P�e�*Ns�����V>�:���G��*���#��O�����_[��10,�����������Iv��<6�^������������r��+Z�~����6_>pB���@���$�����g��+}@[�
��3�:aW��!z�a������ �#?����z6g�m���7J	,*�h�X
��mVsHtF%B��(NURrr�<����B	Am��y�;�N
L����_;�Z��,d����]�*�2\#���&p1��*��{��3��Ls���!�,<4M�!>�u��O^���GU�O��]��6?1Z�J<�H>��/��x��|2|�]@���wc����z���	�<n�E��}��?��h�^��\�;�Q��Nm������g��O�5
Qr�o@�'E�|��e-��WES�o��F��
�cr�.�6,Y���Vy��?5��zzb&�J��*�i��K������b�%�xC��`H(?�������+��d��~C�<�(R��A���48��u���?E}zZ��g���}INyF��0�2�5s,�����Lb�'G�f���Z�%���|�"����SbXE�26�8������o���<JW�o����:�X���,?����I�hM��bY�eH�J+!\��']��U�z"*�	en����P����"B}�Z5����H�������*l�/p1��Q��/����m��R��+���SH@U�'������[�|0=2}
��*�G���1��y�����C~�3��':�'b�pM!V	���(��q�b���V�W����x�1Q�%�+(n�0��l�������%�#�9\�Cm&z�T�����tU�*r��\�}���	pf�9��L���Tr0$���
���u�����������S�F���E/�c!���-]�2�G���ci�?����e�5�������O)74�=�>���6�L:I���O�g�q����?��B������#���o^|�2��db�3���bS7���������N�L���f�	><*lHawX:'
����W�z��O��e8�+G�"��L��k���T�q�wO�#jH������Mrw�f5��w���?������G���������K�8j�?���K�,���\9�i�Q,�7%���r�W���C�I#3��f�
_���~1�A�D�B��~	�F�q�<�8\�<�<K��G���}�'�G��Q����XMLT��^������`�G�f�����B�xV����a�u_������Q�!�_P�}�Prb�����v0�(D����r�n�&`���������������Tf�����(���
7)�7u�7�6g!n���.2��H�?��2�|m����&����5��a���������X���X�X6%��@����loTB3��(���R�=�G������7Q��f���c*�l����[9Jx���nMR������]�t��vMC5����)�=�f�N�e�e�u\��5�`>_C�K��
�^0������P�pzsv������8��|3�zx��
����:�t\["��mp{��|�E7&���01C�|��zy���p~�6D�����u���/e����:<��	�&KM��d*�l��w��Ut�~������Tny���B�]eWi�O��2���^�in83v0�
P�>M����(���t�3��k��>^�6�iq��G���k�Uae��rOn/��9B��\�"�p��������5f������}�T{��Kuw'��a�������Qx����N��
����/�����
����B�C�tI�OT�����%~GB?�x��L��L(0��nJ��,zvV����3�����(��W��ZT���,�����L.5���Nuy/����I�>{�c�.�i�0G���Y����~C����}5R�F�l��D���S�r_qg���j�Y���4+��g���1������1�����H,���*t�{��r�L����JQY���H�C ���*^&;'�z�c��G��e��#��1��G";
��������DO���cQVx]��(y�#8f���Bw/���Q���i���lG>����g�f�'m�� ���5�W2����0�"��������5'S7C��������������o)��w���G�a�}L1�@N�?>
��GTX��@��7+i9�2����8�Aju�1�r�Lz��m;:}��`�B��W���N�u$�p<�J����b_�T��\����#��K�u�P��f�F�J�R�����O��]�K��uw��Y�#oZG::c�z����k�Q���iLw�-��������,j$Q?�Nl�)]�p" cM�V��
_�
����7��!����3��������9��yy(�5=��q�q�|�u>���h��Vm�%�:��0����T���j�d��q]�:wVtM���8z����s������"EW�kT�u�'�_K�t��rc8���5�^�wfI{(|���PnW�bM���:�B�����x�;�4�^���q^�+�������m>���IWyD�<��BCa��OxS�e*x1�el��!���-������V�`
_Sm���!;J�"��p��wo3��2}V$[�FF�%��jD�5x�<zV5|��X����%pF���S�=_�q}j�j���V�������.8:����s�h�k�Xs���]6�����W��'����J��� f>q��rLm�����g���sdA�(H�=�>�a����a��S==����N�2��j5���^.1������PK�m5�L������������i�2����d�^�&�M\,W�iM��I^��]�c��
��R=���ki�C�yq@Rl/�E��Qa����/�����X����v���&f�������j�[����o/�)4�����'	����~�4W%�2�%*�
70~�u)�y���q��
|�G1;��&���:��=uo�������.�|���J�>��-�di�:"�[�5��_�}����I�\����u��g������"ZP�a)9�������+�z����"�����H�2�\c���d)]��v���xT�4��+8.�'�o]i
��B�)�c�I).��T�0>3��OWJ�e<�\��>�>���N���6	����N
,���S�L�TM�b�Q��
3M(�����Fs�_r�6��n�"Or�A�?zzwM�K6O�d���B�@��3M���O�5DK#E�0�'g�R�2XD��K~W��Fi������Z�_6}x�'�``�Pb�����n����:�s����������}��f�*�.�c�6{�s,��w>>=���]����������<d�(��%
d`�-
9����~M�>wb�sX���J\�B�{8��Pj6Za2B�����B>t8y�cI��b�O��V!�o��q,1'<�V���B��3*�>�������u+���.�����QN|o��2�C���y��z��%"��?G/���-|o���Y��HC�0��o[�y�@WD�R��Ca������|<���-1GJ�����'
��;�����U�)�=�M����f���#9D���Y{\
:c�nN�m��s��L<�wv1E|<,,�Y���,��\�#l+m�r�K�Tb��0�R��U[tmo%uk]��D�J.��w�4�V���D���f1r!�����0�������(��R�D�h��������!������#����B������X�>b�a $���V�R���=��T��!�H���-Al���$�\�����������n��-��B�����Xg�V��B&7�=��c�|:����{�������C���W�sjW����Uo��=1�><��c��2�C�}�l
��$��V�r��m���y&�u���A���9��u�0��O�3O��c��$M�*�����z�d�Z,����Q �����,{�;rtU�������y�aO��~
���(#��+��O�1m+3��"�r���!�U��c���jxz�n(����A8�l.�g�]��4��2;��7Jo���]�N=b���;�����D�'�e����M�=��m��K��u`�JT��6����XX��p�
�{�p��T�=Z������g�����D�[���j�~�����v�2P]j|>�����hp���m9�0�V��`p��XFA}��J�����O�!��0z[9���
�Q����p���S�y������D8�
����b�o� ^~�W>y5�+�9B�Q�������+����w�+n��6����F�.6BI<���x���c�*��F��,��<p�O*[-����3�	4��,�X0y�C�t�G���|.C���V�(�^^�%.H>m��q�~�@���v�@K���������-|����$/}%�m����%:��}�tu�11��m�I�m;�g�Rj2������4R��_�r{�'�P65�o�^h��!�8q��tgq���-[��	��qj&K�W=��4�I�HY�.$~��hb���� �~��6B���=7��z�bc�)��2<
��u-U[�.xTi������"�p_l
�TZ���.GMw!P��x���oWLB�CpfU��o�u6,1�Q�cXD�@�����%;��%l����-7N�:N(���)�ss7�s�[�NS����~
�y�_���}M�$C�b���o�8esPQ�x,�R���������C��w���,<����:�
xM�g�y��c~f�i�{U�Z����!���>���_W�����1����%�NWq�f��}/�\h�R�l����i/W���J�����[e3�i��]�iv���K�$*	�b���J����,,�����%����H��>�rTVZW�TW�j�
����yj��vP�>�\Gi=V�cp�t�����hR���c������U����)�|���oF���j��&6o��C�np���jx3��0��VJ�����r��a'��e>2�zP{g�W��]T?�q
����j����H1�m�=��6�O>�V��-'h][W��]�u��Y���R���1��W�I_��~��<k��r99X5]�xf;��(�c��#3+��o��EC������V*�;wD2�:�����V���{WD�!�a������x��W������"�[��n��������W���%N#�YV/���?W�K&�����6���-�@�G�kDo�����x^M�����}^�p�7����\��aB�9����G�L�����~q{��?]��a�Y��=G����u�e���>�@K�c��<�6��x����4�����z�i��a�����b��`�W���Kh`:��8~�����T�����~/�������K�#u�Ct�t;�������T*
g*�{��*��&�!z���{������7_]��,�M�����]���M����������B:�����AlZ���n�t��B_�����w�l^,�f�������xi��4O�x"_5ww{���Ku�A�r Wo	bd"`�5@��n��"G��H1�R���d�V���a��Uj�	����x�["�1���
3�B�kp2�~;��`�c�p��u��=�#X,�Tf`�~��T���@���P .&G��P��������[&����eL.��$�3K�w�x}}����<���erf�F�0H�q].�U�FZ�� �����>��1��������Y��Aj������(�hW�a�� �����Fb�����_�1!��u������,�`'�E��:�##|�i�e��D#����3�F�c�a�A��y=�3�����N��F:�������������f�:�Y6��+�j� )�w��r���V����{/��D�z
#���&em'&��������q���DU[R�t?r����G�ik���t|�����&E"<��P���	1�����4���D=0���i�gF!�K�[�N,Bsi����X9��{�-!��hp���K�������O����J����L��_@v�
b.[��rx���R����&�.��eBX���[%`�Cl����,��EG
��(\:u-'�����`�v���l+'�7E��Kx!�5w���^K�eq���/�8��0(@�[����DX�n�2c����1Qr%5�_����s����7_8l=���t$���[U��1��2��K�Vk,�{SGp����@
�T�X��w����z~��t�B���?g�����
( ��A����K89*ni�~�*:
�
����H�9��������C�/+Gl<
"��b��W_8����,�|q2�TzK �&�O,NC���M�ng^��<�����x����j��
-��d}T@���b�l�jO$�g� O�
��<R�[D
�*�Cj����@-�C_W:R���aoA�c��,�2dRD��t����I?*q�D��	)�;-����0���nd[-����B����V^�Q���q:�8�=xN~�r�_b�>��=Q|I5F)������b��OPm-g3J�5�C����t@v�X�a��vt��� ���L
���Y{.&�c6�����w��@���_���I@���A7�����.�
\>���������\u����J|;��/��V��V{�f��D9�b~(������K`�)4���Em�k�Sp�����)G�l�.��s��qf%�d%*��u��e����c�����S�'R�s������(m���ebE~JWO_HA��J��Z������A�]��WdaKda(>�, MQ[s��c��bF�z������G0X��N��
��&B�������O�����unfd��0v�o�D�z�X�X!��Q LAHNp����"�����H��R�if����e�A�������~�y���������
��)���'�EE��2-=I�$��������
������Af"y'i�N�{�>��H`�������&���������A��$!��wR���r��|w����+K��4Q�����$~u�U�����!�.�O6�e&h����tB��6I	�I�E��I6�*�4��H���i�.�"2���\<�6���d!�<9���6B�������n,���*S�2S���S9��e7g[.�k�$v
&#	�/������=YO2�����v3.V*�\�����tM�)���E��Q�\$�o�'�5O3���'�������G�"�����I���
��l�w1~����D+"G�&������q� �N�Vli��XI�a�v�
�������
�Y:h����R�.���J������k��'�O4��.m�{g�q�F����%��
DBf���N��e���,��,�G�x�v
��!f�:��Ng9CAi(3����,��r�$�4Z�����W.N%�{����j�Dy���Lv�3,��v?�n�R
PCk�z6�������fj0����w�\���w{�`��~@=�e�j�	g,	 �(��n�%��������[�R���Y��9�4�*�]�F��1��iq\{�D�S����N�$.Xr�8mm�B����j�b�5q�ho9&��Dh��D����E����1��fZ2�����4��3M������K�2J�Zf�/�h8��V�������2
yi���*�P�e��r�/P�L�rq�	���8���l�n�$r���-�����3E�JN�c�������KgZ��i�X=�u5��4�m���B3��(j#7�?5):�(�R����	��3X�G�8'S����'�	���*�G������SN ��W�3�����L�!$��os�����&�_>.Y$�DIL{
��]�l���'��Kg���m���)0c�R]I6�����jNZ��1������
�X�����<]\q�q}�������e��������l���q�������k9iF�W�S�&�8g�������_�/&��2�>�9��V��R�>���A9�-���6�\\o�6$,�>`z�(Q�O��G�%3�}�/i%
��xfH�	�+������V.�+�)��l��l�����`��-�	��pM�?������ �Hi5��;N� ��>�Vi��	! 7ML�����zn!j���K=�=��<��6����J>�+eC�#�	c�{���{�����
�noK�T7+�@>������2��Q;�����~a�K����{
�};�� =J��O��g���m>�n�_i�zZ�*v
��
�g����R���j�W���/�.�X	�u��b|�'���p�w����h��m�)���jKm������Pm,�I���Q9izZ�MW��M
n�Z5�H\RW�����)���h�u�L[�
	Dc7����;���t�wu�6���}�B�L�(O�k������O'n�-���,"Z���N���UC���K��x���������jstZp�9�V����g��^Fq�f�}�9c�
*0k��0���e�� k(Yrk��
��Y��D���b�������})�B����A�_&W;��;��q^ �
7�������w1����N=���U���cAb�%	�������8���[}J9�T�U�&\\��T&�j[��C�
�J+p��c
���<���C&��&;��e������R&��5L�tce��(�f��.p�8l�!�n��S�F�ny�o��w���`�J�D}D�.f��n4�����I4�q���e\���[���d%I���f_�Di\f:@WO�����m&l������H� r� `�Jc@�Z�����d3TH&�� s�h@��xwgEQ:/a6Y`���Q���:��}*�J(Fm���!+
3����4�mi��X�����b92�XF�$F�~���t�C�� m����\�P�|������Pbt��3��0wiVn�Q�I�`�g�R����r�6|�����VN�d)�U����>[�h��0;X=���8T�<`��%��������7?zh�Zk�B��5Z�����[�hC{VS�C����D3�w�%4���%J�Y3����n	���X�6�k�X���3��[�8�,5�	J�T^L�&��wzq;�#??:Zhs�H�m����k��2�.��u������d���&f~l������#E�E��������y���yL�r%�m�����IgEG��k�����\W�Go�k�h%���)�������9���2����K�o��z�"����Dy;���.��"&s�*
������n�fA�D����,g���b����m}�6�]-&����9��4[uq\n��Lt#�I����e�%����O�>E�~$�!d�[����L-�K��G��P��9����W���wF�������b��J�������Q'���1;K��C���Cr�
��
��c�5�7�L\���M�;%M�v��2�.�S&&
�w<�['Q'�Is��$(&��
�M�qq;T����OH�L���4u�V�������p�y�� ���_��K�(�i"�
`�8(?��iP�ec���L�"�	� F���K�/�4:�?�s�6�U6=(|+%$��x�'D
�WI���W#�M�a������"f
���,+��A��1����E���D(,(�����U�AW�I���������<r5N�`y]
{�V��nA�2Q�}������d���]Il{�B��J��QXg����#���Sp�]�sc=O~�r����������~����T�|���/�Y��A��+�����	�T���k���X�q�L���]f1���1^O���[E��B����+�"�8pi�T�K��q�<��_R�{Z��2���Q�����>;	�ow�8<�z�;�P��gr��_�XI���-Lq�4�g�T(
��d��r6���$����l��F���N�3��P�'G�F����K��'���������v=�����1t�p���?W>
����h7��f5���������� �&��p�o1'F����tWS�L�����W���y���Vj+Q�*�?	D��T���5L�
y���:z���e�i��Wg�~8 WglB��b��*o�}�Q�E�K����7��;�g�'�,�B_ �cK��}��z(��R�Cy3�D��+}t5::�3�0Jr�����^N���(�r?H���5�TxZ�Xhd����~:�Wg����o�������OM��_���$6����������L�{�����{>8�+�zR�a-g��d�rP]�;����O�WJ�����!=7�g,�LE�"Zs�(o�+�F}a?��)t�c�d�.��h~�%�)��U��q��D�Q/7)R��H�����n�[2����g�#�AX���/��r#�������d1<KF?6	��������)�A"J7oY��u$���������3�+ ��W-"k�V���[2���1��zri6��������t9�XX�^�I�/fP���������*��9B��rR�k�B���rG<`��� >l2��p�6�qf����u�����������X���-������o�1SZ4�Q�=bV,������"�A�T�z���9�:�Ub��
C�����?0Q�%�M��<����m������e�'���&Ly����7S�}#}bgS�`�q�x����Si�g�y�xZ[}�0������\I������rU	��D7�V���e����e���J���-xC��G)�vm��aGJ�M?�;��
�&���L�+$���o�c����76�!�Z4�������g���[�xwL,�N�|�RN��4���'�S*����F���r�[�b�*<���*���*7sD�8im�ls�����^
}5�}5���-sf��"q��oM��'�H#�0�\����p��_]��|���y�w�n�tT5_j�;~B��h���Re�%�%#�T���n��U�,���?�B������b���
�r;���D��������GV�f�l)?�����Rf��xA�iA�&�V)ama���������)g�)�T�%G���2
)9�x�`�����
�@	
&<��mne7�ch+�/����;y�II�I�7��VW,�i�wi7�wyCr�S
��Mt?���6|��fI&\�jX��T����^d�:�H?C���\M�`�J�^|��9�/	��-�4�W(�	�������Jz��������FX�~�&.6nS�W^��J��%�%wE���-�](Y8�<�77���.T���7N�{S
�Y��u-[����sd)Uz�=���^�{o��SS9O�L���&�S���]XD��*��W��Cb�eeK;'�4�=37��F�7���*���W?��e�X�D�
j�>s���a�Q�'�2����3���g�\�pU������I��_@��b+�e��xKe�7k�7M�mM�c�?���L)3�"��<�sO�x��)�����Y�b���w��X�<������2M\,�K�,hk�� I���Cz����9�;HlN�p���}�����	���&��6�J���:����_��Y�0�\�J�Y�aB�b����TI��F�l~���m%�sz��F��w����!v���0����D��I*g8,�y_�";�Of:�d�LP.n���3��0���~��.>�,2�������4�����0N��(M�@}���N�U9�x�����~A@�����I�%>g�C��e��9\�Mc+��5���)���r��������|��eu'�����{���S���9�������Q�K�;%*1���|�/-6"���a���]���������|p���c�S�/�c�����������f�N���e�0�� FU"�N��-�+�H���e��s^������}�"��������]��dC)�g��|7g���?�O�=]������D�\�/ m{��m�Y*B��!H�s�$������xZ�>9��X�t�=v�'���M���l��%Txp2�q^1����5�;q�iL��2�t�y&6��z==.���E�\V�V]���q�`��	�t�v��d��iL������A
��7ED
(a�<�eai�u���_��}XY[}�1��-AO�
��%eM����l�����$��?�w���,Cv
xV���<4w8��N����b��Z"x�ice��XTZ?�|5^FB���O�5W����?`����J���<���1�7n�?�}�s�z�����u��0#�Lw���d���gX�])�8�w�F���.���N|
_��
=�*����}�Q����^�e�rO4���U��c����-�Q^�P^�*��WqY���c��	NM@]��
���t�K�~��x3S>��s��"6�)��]1��h\���\2���-��F�2\<NV{[ ��jN]�z�l�t�+1��=P��R,�9
��+�q����������eX��F5��R�&���^#���u�kr
2G?v��DM��bn�������}57Vn�������=����`�dz�i`g�QV�==;3���kh�~�����+E�*0K�EI�"*��5U�&����X!�`�=+"L�v����&����RS��VN.'��DQ��Y�2=cQ����6����9:E��a�r��a4�U��.�]���>���F��$~w�:���E$O�C�q
����7��ty�m�H��
����^�Kh���
/�|���t�� ���p�z��m�*���GA�m^E��5E�$��M�@��I��/�-���Tbw�,�>�,j	ZDB���sQ�;���1�J2�+������E8����
������zW�U9�Fu���MP&�/6�n�����78��=���vMt"�Q��^qy��/�;A�p|���?�x��/
��D�N�(����LJ(g�b+R�,S]N�/1�c��4G����Ch��7;2���
�U�����%Lx�F���7s����>u5�Dv�3�V�NeZV����
#l%�V�m� �,
Q���U�� �`������;�d�i���?j9���u�=����Y�Je�a����+G�4��vH�Jc�+��������Aa]��%}]�����O�=

���HBW��.u}
��k��]�������_]���%C���1�P���S��Mb������L��d���b&�>��K�7��?����`|?+�!��p��i����6����9Q�aS���c�Em������Kt�4E����y[���J�����/�_]\�//��9gB���m�������j�SV��cJ\?�`������t`7MS�]�}C	M���r/������#%�����
q�
ud���*DS�eZuf��O�����(��%��cPC�4��%P��B�"�M{)N�j��)xb����~���5
_I�jr��R���K��P��r;��u�����H�t;��7\h���� �x}���n
�.�R�T�q�
�����T��;�f�sT�T�����BN,,��e��%)+����&�@N��>�m]��%%����;�S��P�4z��|��
W��$���=b@?M��-f�Ty�'0����WG��9=<����[.8��-@���'*sbE�W���.���U��tl6�VN+���K�������%*�Z]�Z
u-"��c��1P�,�������������QV�Z�O%,Ny��m�+�eM<�O���a��N�����1�/�����
@W���;��+dy�jRqKm��T��Hn����^W���j��5�|]�����>�����u�zQ-���}��E\]Oy%��M|WaRAbA�G������c����n|�����|U���:w�����&�Z����3W�
�����z���@;�u����sC����>[G��@M�<�F�
x��%<6�N/�n0�#�
u�a� �m���/���=Vzzx��g(�i���r*���<��
����jN7�����0��1Z=<x�9�]�D����t�c��8����MN^��.����R������o��N�hl�F��@+�6���V`�AL�
���xo9�@��(�w����������t%�
(#���ow����t'����c������U��e&��S{�.�d+u��Y���&@%B�%�h+����U�=�����b�L��6���;�?j<�<I��e���2�x[
L��q�6��q��������w���Sy=�ni%7W����7,}n�����r�0�<��@�7�����V������3~)p5)�`|��P,��6��?,u�{^�^�6���p�����\���CG��L���fj���� 3IS��Zg�5x�� kw��q�B�������T�+6<���!������-)Z����Z��k�,���9��X���@�;'��l��:?�mO��J������y^OE{��,�k>���+���]_�0���W�o�04��w�g��nN7W��Ng�9R����s��f��	���-���-��6�&[�������.8�!�3�!|u@yGHy}3��#�a��
:8��t�+�in���I�
�N��_�YzO
Y.6��nv��4;���9W��G��w7j||���RF����8 D��^�E�F���?)����B��g�Td�����|�ibwu1���U��i�-W1;�����A;*��u5d�cg[K�w����XM�����b>OT��g������u���F�G�H����������fe�j�������|� ������0��r�s������tr�����Eb��7���/��H��qe.���cs�6���B��~�7�\���7��[<�Ic?4�=\��<��k/w���v%��&���D�u�~^�[����4������5{8���t_���{�mn��G)���gc���145N�r������F����fr=?l[
8�x����c��t�s9�7���>��&��P���N<9�A��_�``��)�yb_`i2�<P��?�����~�G:���'�N
��9�L�gQ�s��3��l���i��������1r���eTv	:}��(C�p�:"��U����y�sj�nq�N��tK������#�ye��U$N�i	��y������1Z����;���KO�z�|���;�>}�b��-.:�,'s�S�EK*��_��O�K�P�(5Nc�5Q�z��W�, ��I�=��u��j���U����Z.qm���sktm
����@Q�5NI�Y��5v�Y���?X��z�>W��Y% _�l$�g��nP����s"3�K[)�z��`��d,�oL��E�����u18K�+d�������Cj+D-�U(jh�cCc������eP�j�;G��D4ZM��5-��������f�P��.���N��N�@��U�}��J�}�)���<����M�$�����2l7���2;�l&+x��I$F1�����QH�U�&�ix��`uU+edJK%�5=�'-�o D��9��ax,3|�~�V�w��R��wUX}9"�^���w�B

L/��)W�Z?��}����6@���u�#�
�"�p�OK�E��z!���]��b�h���C$B��������W�*�i${���?����w/�$Sk�G+�3�8����/��p}������K2��!|x�������5 g��,����$t�����4��T Xr����FI�����v�n��G��y����%�X�{<����?��a�Ir����{���������i���e�����'�����3�.-�����
��H�9��ip{�)J�*�f��X=���.S�r
�c��:k�YV�������J�"�W9���2��A8IJ��T��hX��4;���`���Q�A�Z%�V�>M�R�.d	�?
H</�"�f"+6K	]
)�~�`(D���w;�����2�(�4kZ����B����Q��,Mju2�t�_�P������a�_(������ �R��b�O��G�8�2F������N����T1�o��YF�r[�T���5Tm��������M�T�d����|'�|���U.�.�80��#��BV�����c�pz�L�����e@0H~��M���8yA��&����+�Bym�JNA���h<�����>��
jq�O�����S�L��~<������O�U�V=���t�r�
��@��oe*���g��[���l�2y���/�����
>k����A�v���?�z���Hi2�\�=�����N![���?K����D����SS��p���SS����k�����K�%�������-�%���k��Y?����	$�9��7o����ZL/�B�q�����m���}���@���0��i<�&���!K��f��;���Uuw���k���f&�]�����y?���������f�����f��akl��a�z~��v���x� {�>�}C|�-E�6�c|�E�	66�3��*C���b�����A������\���J��r�F���`��@����C�gdo�c�mlf���'�����R�={�(E�Z��SCj��"
��V������������r��F��������W��!�@t��O\"2WC�00�^�� ��6!�?;g3������I�����
F�������F�������\��^G���\���b`f�m����R���YZ���t!�7)��A����������M��I"}�J���������w���m]]}3}H������ 	:CV��f��������k��>��k[��/���&����
�#��p���(<�>p��dx!r��-�
�m�m��+'�`
�9���7N7r	A�"�B�D���������
]K;}�����/ ��!�������ckic�`l���q��z��o������@����2������-�;f����X!T�������YXa^2������C������o}����n����������}�� �h[Y��>z,�������1.V�z����PO`ke���gk��o��2�k����_$���	�{5���O2�������a��C�t���	���cbb�y����������[@@@LLLNNNUUUGG������������'888""")))###??������������olllnnnmmmoo�������#����_��9����5���EK�*����K�xj����_3���o�I�e���V�*����u��>v�^���-�����khR� N%��|����G�^�E&�Z����o)�R�`����xe�]�u�H�M�Mqy
�����H��~H�!��3/;�(&���1H1��"�)������HV�h\��.�}���c��OA��$����'�
�f6���2��E\Y�r�01���"?��2�����gd�`���LI��g��l��j��F2�E(|����:��b?d��V�p��x���@1)��*�$8��Z4���3��,L���0�������
��|o�����G#�)D�3������&���~��������\��M+���*��t�S����t�9>��nP�AO�e�0P�s�{�=�c�����$��R?S�D�T�H�����R�����u�1�������#�o%]������7?��_���g����;;
4���8[�'))��"U� =M��c���*�~�Y6|U�/;�J�}�3�j�48-��l���V��UU�M�>G�������PZ�����Im�C��2a�_�����"�r�q�X7?���L���g>��F�pv��.[��r���������k/ ��,�4M�VVQq�Q�\��a��LF�
�H�4��UD�<���W���E��5�n�2|$�����TOh��ryu�#K���P&��Y`���'����f�@-%sU�:�muT�@��B�@�W$����Q8-��T�s�U���8� �/W�8P��I%b)$=������7�r��9nM�G4�/�|���e���L���AqqY��'E����g����W1�9���jZ0D�8�,������i.V8V�V��$n�y��\���C��6������x���a����j���95�uf��������s�&�Jj�[z��T0G{��}/������������T5���K��|�	�<0;t^�*mt^�1��hF���rm������u�(5����[x��_PN�����4��*����2�����ob�����W�m�kh��Y�4�$��v�BT���z���&��q3N�38�A,m7:_s������?�����N�U0�y���%�-�[��1�q����(�Sk%��@v�3�i���D�]���x����������k&��Q1�|Z/������LF����	9�0���,��a��lK���$�xq�lq�o������0�yi�,�"����k6=��l��k`��<��6�-A�HR�����G��8��2�>%%�>�7�a�e�8��W�
2��}��"4tY���~J����n�b@S�)�Xo�0X�g�%|�v��,^���
��6#Z����[T[`�mY������Yb���z����E�]ml����ne���� ��"#Q���}8"z�+
@'
.�x5���f������s�r�Z��'[��h9��B�)�	�E�7�	�{5F�QX'�E*B���H���J��������
���_]��
�b8��>�S.�u/#SNK�V�G���B��@�J{���E87='�wK�Ih������#5�0|S�����$�6�I�������0d��i����,���Yl�,C���/��Z��zC<)�m���5w����1�g�d�h�aZk-���[��N������k���������*��uGkqV�r(8�����P��nA�8Q�yQW���8=�����R�S_s�~��,��%��#��}��� �F�
�s#�<���q�<q����mR�1��6��^n-�mJ,�ZyZ*`8�Y���k��i����<+���x��
�x(^'�,���s��m*���\<�����YY��\�l�OD�H4)���=m����\��2f�8��4�d1����}Kh���*;i**LO��?��<w/�'�������L%�8�]4<K�.G
1d-\���U��L�qc,�L�Yu��h>��M��%�f�F\�p6����%�|MdW�8&�����F��(JA����.���!&O����8B���F#:�,ax��3V��|�&NK[O(�O�����CP��^���^1��@�j��Pc���eO&<aIp�8{>�f�8)�4����
�����������d1$���Mz� k���m�G��#KU��J�K�ui&*�!�`�����i�u��!#�'l\I��S��B�5^"ydN�,��.�65�D<�Z�����u��3]M��s�E���6������KB�m��j��K�!q��dS
b���S��.��d-?_�P��� ��h��F6�%d��q���j
�v�|����a�r3��fsO��G�y`��t����NB��o�f��m<�p�|�L&nMT��,MY�R��r�d�� �1O!��3�w�vV)V�����U}��Y#�5��^1Gj��.M;���R�c%�B���Ic�J�^��Ti
�4�rK�)`��3�������U]h�V����2�zX����|dP=DQ+��4p�����zvW"����(9-)�9�� ��zq����~��-bm�aZK���������o��{�5����^>�n��4�.�C�_"�i�H���k�'���p��>/��R��@P��}�E@��i�����K���� �@#6�w4�i/�D�p�l�tt~��\b0F�P�����^������=���2O?m�vl8�~�-��=��X�<��oD��?0���oNH��������ZOP3�
�C�	e�+�k������?:Nr�����P#�uh��6T��u�w�W�SI�T��0��B��Q���"�/�x�Q4��!h��n�v��c*-�8���K[���G�\�O���5�19�#�1���hJM`��lT�96��P����t����Q�R�Q~��x��Jp���e�������'��Y7�k�Xk�^Q���,��[h����,��s+ah��+�������vYrn���p�f��k�����������&xq�Z��}��%�O�%��VaK����1Z�8a�2��tZ��]����L@U>y�c��g���1u�Z���w	�M���7SfM�n7��F��l�C����T���}:D�>?��e��8"��N ch�?�g���X�a\4T�'��=O0�)��j?4n�� �:O��l48�ez�E>n�(�7�^���7��n
a��
{k�9��5��q���`�5`@V�������y��Y�����'X�NF6B�k�9�v���e�_e���`���m�@}����Y&�p�v������m������iSd1����Z�m(�0
���ii�/I���m��z���n�>�� N�w*��������L%,d	-�-���d�	�n�*aA��_�@	����ES
�i�!e���VB��Ge�����
5��/!���~EU��,��evjD��jBDV��Y�i�E�s1ZM�]��w�H�)�6-��j�8�n*JEs��MI��}A��6�����~)�r���N�y�=k%���:���z�~�S	{CK����)\]+b��w�[^���;=s3X��L��%�KpM��o�J��r�fW��b�k[��k`�����Y���ow���X��fj��d?�{0fW�Jm�^�~u/���U+�#����`��qv�����`=�=^���S�q�t�7)����J�QwYs$�g���3,j�G�M��;n.�����1K����>
���R�c���E��q�f�US�^��(����U���m�^��l�	\���C�f�^Y��r�h���P�����v��l���e�~Y`��!RN�����VB�$���RU6�{n������������L��aum% ���*�0���e-���f�J�b�����I�{��/
����S)�f���O��j�Nl���M
�g���VL�^��!,����������eQ\5M���s6��U�z��>t�_����=%�����5���vF���}�m�E��-n��\P��0_Z���������K�/����F�Of!����E��+����A�.0J-�������l1�����e�M��}c��D� �^c�����`f���M��*�1o��pJ�(�YS�S�<��A���Sk��Z�W���8��zp���8�W���{��1���[z(���		o#����BA	���D_@�_���f��u�����X
���X�������3����'������z3�h{��]�����DsZM���W�']e%�$3���e��Wz��w���bkx*
'/'�Qf�3.�1��t�5l����R������������2�g�cr/��`I���<��%�Y*d[6I���(@�<����B����}A��@��(����	1��k)x���O?����e�r����%����� J4��D���Kj��$�L����r��D���gcFE���
"++��]��ps������]���)_�R�����R�<�b��R����o�Eh���H�����J�7Q#�����62��Q
�)��H]9?�*L�������eA��8��]E�_n������o
Ogc*(�������v����!��|���	pj� '�7�cK.e���o�a���J�6�Gj4�����Tl�|��'	�N`r|�{:e*o�J�����V?Z����&~IPu��-��~��<^6���C"����
�����rb���S��%+ syr���~\���lm����1�"����������
V�j?�X���KU�Q��R���pj-sN�A�+Ld�f�H����j�J��G{
�=���@P���QC,B������{���#�pg*G���Tb��k[�n����'WkQ�@3�"����>jy���A_@�~���.�|�w�"m;����Y��*���@0���
N��J����MH��z������3]��D����XZ"K��RKE��M����(e����T�]��&V���\�o�0�v����nZ�U�e���vV�������|�%�t�jC����=)�1�C�f=\����M^2KpD���_g����1���G�N}Bn�6����5��H�=��b���J�S#�I�`��K�XW��D������0�)��3�OK�8!�|�)�uzL��q���3f���m�C�b��>�?yxF��pM��+����� �.�U��W4�>G�t��btKn${������9���;!f���+���X[9�4��iDgB��;��$����|�?U�p0
�w�P��F���[Z��m��A?o�@?s����J�[~����!t�PCL�e������F9�+F[S���s��f�o�"�Y��}�����
�E9WJ�F��6��;�M�(!�E,�M��iv=�������XnM��R�v?���a�y�����/I�/Z8!}0&���-�(�k�yH>N�����������M]`���mx��wIZw��H�T��)��������Q�Pi�A��U����'������@��z6+��<�������U��a*
��d:�$���J��]T���1T�����O����C��kcp�����y4M�,hx��{�{����aDMC`���9�����Q��p5����:Wr�^rO�l�L�K�����>�V�~���Cy,����Dx]\����}{��{	6����0���0�\z��W�:[{2T~d�2�K��o~�o��7\M/��8����I�����*�:��P���� �'|�4�p'f����A�����u�d=Wl>��8�����/�l?A���sD�w��/&�x�G=-9�m;���EMo��~�����������1W�������Z�q��;�<��9�2��s��!��/	���pxf"U�'"`j>�m���(�L��;����#A�9f��c�����2��Dv]�����2�>�����~E&&@��[y~��5��k�������y��@V��V}�2��b�"��������00�����O���%�_7B"���^b�]���&��9z2���D��{�(�������C���c�n\��v��F�,�}�>T����K��I67��G�0��o���9C��u��(���F���,��m�����%c�����J5Y��Y����X|)�����&�$~���c�di����3���--�gq��C�?�=������'�Zr�3H��y����5�l.�c��F���?d������,q�Usw<���������Ic+c���kS�HM7S���G8�������R�f�b\vU�u���d����$���M
N�v��|4q���i(�5h�W*e�n��Zp�,W9P�G�Nnp�'{G�g?�+I]�a0qS���F���A�\��Z�`B�&>��Y��<��by�������nGT������3�za�Z:���]�XPQ����X�|�E�t� "���p[������7N��;"OV;��*�nx�t��T���
�6��]��%�\wJ5�nRh_���^jw���1�����{�^$�W�w��v�����Z�Y�[iP����<q��"�'�N��S��`�p^�Fn��76y5���'K�r�0�z��v���d�V����tZ�%Ni�dG������13;W�#��f������8�s������XM��q��h�0�tj�
�������<0�2�gX7#%|��:U.�<}f����"(A@�t���=�g��M*��A`�%��6GD�M@�}y���U<�S���g(���44WG2�wn|;�Gw@&�	��� 6/:������E��	5�K��r�;���FTg�M�1��T_b��(bz���^��l���3�si
�X6DFQ��r^e2�K�P]Y���Y��M&�����E�K�2�N���"�*lJr����jI�c���m	��_~QN}��&�M�`V��+���#��U�R��g�x����!�����x=�����kL�WD�6I	��k/[�2�o���hK��E��q�Ns��9��DQ��5��b|i��aH"=���@���������C��[Z��*Z�6k���mV�|���;U���������83��^2���k_���k�> ?��r�z��c����=`��t
-eF���W���!�T����x��2�*���ix�u�M����r`�T�^�gbS������qJS�*��eP�������2y�����<i����IFI��9~�Vr|.NB�a�ze���+�R������7�2����mT&Hb�0���{,�nh������P�S��bf����w�lY��9P5!����W�'����P9F��E�(7B#��E�;��t+����|�L.�BM��yMR��]O��xhB�@v��_J�:k��A^�]��[1 ���SM�B���'�e.w������6J����H�oTV�H3Wwf�Ih�U����l:�U��~nqM���������.�.W+��
���<�fR�u����ub�����D��J�=����m�F��Uo�����
M>�d1��]9�
6������P�l�(r�p"����h/w����7����F�!J��[�fu)��siO�g&�u�2�c������B���.��`�F�h���F�W��1���3�2�h�Z��5����A��"��-�:sol���u�df���#}��P�
�f������#v�=)�a��k��_�6�8�@��>rX� ��~�\��j+����1?��hi���;�qa�C��W��ja��n
fXJ��r]I'���{����K�0���ZJ��#�I�2}e�'���q�C�g����:5F��
��=]����>u�?�'�A��(���:d���P���{XX]��?,����s�DYc��cY����iw�`5T_E7PH�s�w��Eu��<�|
���=�����v�r����&o.�'|V�!��:9��i�9�bjA�]!���@��4)����	a�$������D[�ui��!��6����y��@8(��JJ���T�d�TUi~s�k.��{���%�>���
�_�nw����'�[�����m1�#J$����B�o?%�yx���-�`�|���k��i},RM�����p�u��_r��3{b	���'�=�x8F��"���\���tF��y�V
j�����K�����L��@78�����-���np����}��
����o��u���"�\�~��������;N�����M��
�����;>��������t��}�������<`$�����������{<�
u_�Ij��<�4s�)4RA:3o7u|�H�y)�"�@��'��W��.
E�g��:�z���9���*:����T�:GZN�?vT���tj��3������Q�*B�!��|4��r�"(J�����|��7x�+�g�v�Lc@��p��!H�}�v>�u�b��1g���i��Z�c>�<Z�t�C���P�74�K�|����\u]��:O��Di%�;����vo=0�	V�+>���0���UqJ&/\��?�4������/)����3=_������=�V~��}����t�����&w����>�g�G@#�}����v}g���I�9MS��u������I�7D7?-hd���2G�^��_����~-O���>�Z�L�&�{IvS�Qda���F����.�w|r}^��/pd���K��R�H��8
���	l��Lb�^���P�%q�R���.c��g�J���_q��!�k����x��n�Pi�[�LU�X���K4�m��|6��f��^Y��'��_���gON��HJ���������m��~R��8����� aK"�y{��L�}�7��GqA���e0��`^w4����s����_�7S�W2Q*-��/W<^?4n[�cxu|HY
�B�y�o�,��vo���</�s�S�E��������3��������rI����b��������pzsPx�d����>h�?��b�lnlr�=�U�Y2.y�����9�Z���[6�{�6M�;{�^N��V`���z������s�����XsA���a�����5��O@gG���g�����'�{�����r2K�%����!~��}R�^"��\U#����WM�����M{�F���S��p��n�3g?�tu�p~�������<>Vuw��^|}�vH�}��t�����p���p�����1�������->��h�9]j���?	]:�YM=��o�<o������5\/�����a���7�����~��_o<�NV��?��,�������������3:���K
�~4�����-
�Z�������Ng@��xwp�?����',���q�����=���;�[\h�%����n�.��[x��$���YC�k�S�c����m�b�{�qt�f&�}��;K�X~��������:
��xI�z�{v�`�|���y��5�sT��r�G���_sQ}Q���������������	�\�0�m3jz�\z>�{\��i�{�-J"p�CV�����L��T�nO7�����������f�}�K7�t���N��G��-)S���k��j���#������)����s�
�J�7�]:=���1����5*>���l����]/4��������CM���gC�5n�_������#�h'G�����s�@y�[�yO#��)X��(�����������Fb��:�����|7�9L��4�rOyE.|Y<��0Mv?���i�0��v���y��&w;_�;W����r�>x�5V�����O]������S�6@����w4A�U6N�c��C|����5�+X}q^C�	�y�g	v�u>9�[
�z�V?���wy6�Tz����-.���W��+��bd�Q5��sC�������&�$l�!uc�$sL�$����9\5�;]�<U���G��n�ER|��;t�����W�S�����{��5+�|����5��|E�@��T7g9�p;���jD!H`w�>q����������b������7�����Yst���Mw2����|5^3�����������p5�+7aI�������8��x>d�44���N��Ln�N'��R�k=�5w@�����n��/1��M���OZ�N<y�=Ke�lO����k�36�
����e�|j�
�.�w��k�s`����V�YEE~���%��V������!�T���uj*x���e(����rqco*�as���D>9�{=8�����7�.��w�{D�J@��j���D��{���C1kC����~0���uc2Kn��RC��;h7<�~g�2s�������<�Z�`JsU	�O�o�m���Q������������VAu��7T�~�S����0Q��~�S��X�����B������
�Fa������-��|�)Aa_����U���������9~�S��"�b~|2�}�X��7K��.�����A�n6��b��A��g�����S�����7}�
���+,7!���r�nM2g������3M�`��QQx�
����.�������r�k����!�����AA��"�fD3��Qm����)����s�h�x:�=�k�)-[2��T���O�����gO���0F��R�^d10�c�t�B2' �)���������������57�������������u��/r��}��N��������Z���+�F�Jf��{��J��
��]�|����{� ��'nw�����
M7+g��G)���������������"^���\���W�wv��X�����8�a���u�\����t��7�'����Z+��o�tL�[��Y����W�����7��W{��f�ZZ�#;D�"����v������3s#gj�E�������o��"���~bNqHS�q�R��x�M����������������T��D�{���"�f�0���z��!���t���Qq���uV������i��e*	�*�V��w���
�"�z�X�p�����yl�T��84�v�Z:d	~��Z����|�	���:�G9���C����k������0������H�����\>�4o
M��b�c-S�5��
�����<�5��<;�j�6�C�g��x�g������'���;I@)�������v��K�1��p��o�����L��A3��#X��l�����Y#.��%Ej�|�����N}��KY
�������!��������zp:�8t�|=1����w�������]r�M������0��0?�9M~_��T}A���0X��
y��CB2�".6"P��������E\��J��4��"��v�.b��L	*E����T�~NJ�4oi�g[O6�BIk6�C*���Q*��T����KR\]1��P���)���������d������/R�����rb=�#%U9����F����%]��r;J������X~�S'#,W+��;�;�1	K�����q�5���@��=y����C7K�	�<��n�s������Wb��r��Km��7�N��!����P�J�F�[����S���*s���]�.]���#���n5�4��}�'�'�~������n�}hY<��:�����'ES��i��>�d��T�D%�q��V�����U�ZWk��7fF��������_��[�c @c<��|�S�0�������������k���$-�z8��Z��[A��zZ{�^�<R�����6����rd9�
��.��k�e;����*�b�����yDE��-��*���P�_l_��2L{3L���B��"O���\�n��9���0CFi=����Fs�����4����]~�> �������p�Y�����
�N�]I�����piO���GH��K����������!�XX&w�P�=I���Sd�b��6v��o������y�������\���(gUA����V�����0��D���zyF��]� �:b����a���k<�Wx6���a��xj�j=~j=8?�,I��g�5�x6��=jE �z�}�Z��i>Z�����F�R�*���nj��},2b������*<��3\��~��������r��xf"���>	��	^b��f��Za���^�����=���4x���p�f��B���U�M�,���A�����wP�Se��G����"�r�c#����������2���#�"c���K����HC���.�n+��P�wu	�o)"���!����3�Zz��<����{�q��=����������;�����h�mo�m��g�.p"n%��#Y����9|v�8����Kc����a��4���GK���N��C�o�^�d�O={�
']�s0�m�Yk��������D(�����OE���{��y���)NtJ��P�~�����F#�}�ri���}�U�E���EU4C#���E����n���Z������Hu?�-�L���������O3bh-'!���I=JU8Z����Q�����M~4#T.��q��������
K"�.qUg�vB�����c��-�����w�i?�i�I�|�����w�YY{u���t��c��F�&Se~�b�6hG6�x�����b����b:2<�_��Wo��4�+�I?���3�{��v_d��q���y�����L�c����^K��A�lC��x����R&N�:�.=���Y	���Z�������2q���[�i�q�]
j[�{�/	�������p8�E��F�����4��up��!�TOL���;�)�2[O@�A#qQ}������R1��f]�Jn���n��jn��U���3+���cv�Z��CM��6���0����<?�0�R-~��E�X��X��h�!�8�����=�)B�p[!�
�����OF�8�I'�������'�a�QQ+JCJC:�~����+,��<'/<�������8�����U42�M���@�?)���m��X�N][��cP��`���+&�W��(/&�-Z{?��zWT��Qp'�4�H�wE��zgj��
�?�����a� @H<��/^A��C�^}��pf8Qx$���m���K���6�}��2����l�w���6�E���J����F��p>��E��O�������s�[��W��P������&��
Et0BA�,]���.�lq�����7Q5B\~�o\��X��(S������uq
��p��A����HJ/|RT�����R�_R���o��A`+��o�����[�h�zE9����T��%|�m���^�t�����,������R�upe��f��tY��g���[b��o	"��:D&�
�NQQ��j}�bc� �jC����;�UL�_���y�/�D�������*y�pn%�V:h��TO6D�{���UG`9�w�HS�����N�K�wI��,{v`jS��!=!�J&R��E��s�3�I'4�����*]�g���!s���-=�Zj�*]��=m����3��G�8������k���u�Fd������m��V�T+�#��[B���w��pr�1�Z�L|r�T�!w�P���P����;��s�����N6�#�U�oZ=�D��c��}?�o]a�/HK�z`�v��C��_����M�E� �S���4����h�J���U������������24[��$�c�_V�������W7�4�9s/���W(U����R��Gw��C�X�&����=Ck���[4��Yts�7�L+�����.X4��S9N��Z���`���;����E��_D��(����.�R#j��"�2ZP�Z.���.��@����N��:����%��n��5��
�!��1?�����i�l���8���������=(����J[�y��B���#�,%���\��QZ�k��I/�EJ���Vu
]����y���I)�zO,�P��(��	�QWP�0w��8�K�M���Xyu*�)Gd��^S3g`/(~���kM���z.�/�Z'TK�����vO���G�
����nq��s�B	���)������4��,k�%��jRmJ�������p����Rz�~}�k��:����B{�Wq2���c�B�`��Q`��^6c`B��z����9�|�;8�.�X��p�O����$�I��a��"�����,Is�v�����PL�P��h����g�Z*��h>�?~�v��7��a��E�[�F���DJ#_���>�W��������������AS�S6��qw���(�Z��#��c�K��y����������4��4?E� 2@	�d*��v�/b>=��e*���<�]cE�"Vl�kbO~��BV�����%
�%M�G�}��O�����q������>��"{���v�����%7��6��`��n�������Tw�o����xG;5��u���{�����9���|����+���*�����if?Vy��M���,��u�w����6A����>Am�#C����j�Tx��ycO~#�X�����l�Re}X�[�h�/o�8h���	�!�������)h=R��g�m�"	`'��'�������4-�4����f�/�8�\�?qk�ss>��2T���������l�T�II54to�;��U�"���������&��m�����.���ko��^�$q^�����l3~�H�o�5��i���E&�fM�e����.ha]�k;����*��MpX������\y�����oI(�C�h9H�&��"H����2�������
M�B���,��B�Q�c/&	��W\�M&Ido���\��C�Z�F�j�HGN�r5K3�H�P�p~Y��7��M��v�:�(�:>u��W �Gy�����^�8d��lFi��<���*���U���f���I%~�����,#KEZ
���9*��0�#Z��r[�[*�}��q���va������MP�nJ#���(FP��x���V~�}q�7|���v��n:�����l4���N5N:��5�3,�
���6:�G�����)~���J��AY��=#������)����(�5�%�k����mz��Wi%T�������kHP�����`�
�������uqw�q,����4���W?�����f�}�&��Z���zw�F��z:�������SF����S�������(�,��7������00����n���t�jO7fC�����r��BM.�=A($���u{A��_�h�����{C������5�!���-�_�<{!����mm������
�����-T�5lh�Vo�]����B�$�i��&��������,;~{e5����5��Fn�&m����t�^�Y��o��m(6�G����>�j�']I��<�_����q1a��p����:O�{fdk��n����nc��}��6d�:�w�4���$��KE���;D��=��S���m
�@T�a_��(������v�Gd#J�!��y
UIny�c�������w��J8j�i_�i_ �������i��$�6��\z�H*\�������En�(���J�^�?��Dp2
(��!��Su����h�n�bPrJ3�1W����}���������I�q����S&19�?�i���N>�x7�F�0�_����o���w��$<���MR��u����<�<��������`
G}>yF ��a��?���OaR�m`F�V�7������D�T6A��L}y7�����R;��=�1����~�=������Ik������o�q�M��M�1�"�V0v�'k'[5�)k A��Q�F��f,��*����S�FaW��*�L�Fs�L��M�<��Il�h���Z&m; �3�:u�;�f��b��Q^KF����������<���/�d�����b>2L:��<�t[���h���rqZ��7_@e1��L�����'�0/�#cE}��t�"�A��$
��l;���F��y��5�����;@��-�J���
�>����3��
�������~s�^Z��q�W:��S@�yVBl��w]���5W���������������vGM���D,�cuZ�=r��������h�+�]j���%n��:��X��p�`���X��9Oa������b�.dK@6*�����_%s(<7�KwxT�H+
�� ����`����;1Z�k/yQ�/�(�����������~s�/�
�*��J�fr(9\���Q�x_6E{i.��������8����N��Z�k��)>L�A|NI��G��r2������������"�����R���"- ���� H�i���K��A@DJD���v���9����s�������Y�����[�=�!7�����.���7W���z���-H�%����@X�gd�0��1hkf�{�u=����������z��V��_�Wx���kg��/�M�4"�{��I���|n�{��d�&_�M�XqJ��M�E������,�	AN�t����o�k�X�#8����SZ���k�t]zQ��vf@|�dz��;������l����B+gs_����H��%���a���ni�8���q�����7�]z1��-�f���V������;���K��|�.���ZB6�p�U�O�g����/��|<�dz{.c-��	�(�7h?�L�����J.�<����NS�����u�M��|�X;n�������a$��V�X�����
�[X-y~6�p�o��w���������\Nw4��>�E�$6C�q=��T6���{��v���,	l��8\oYQ����������orA�0z���(0�����7�n,�H/��0����ki���Q�����=��~O�!���A���j\�"�bl}K��.
�`\�8��+9-�h����=QFE�~������Z����o���06��s����8"~����D��f>)oS�!���D��l��]Y|�2\���b�N�ZTk��0�#���2��Q����wh���I�\�!G;��gK�H�mM0�91�����5�e�1g��"�~�4����G�}��*�'��*����o���^J{����'�Z����$�=�]��e$���$���q��Y�����iK���]fL}F���y%w�����(H5���rTj�%��}`�&�O.����J��#R�l��`���9�L�����e���'����q�}u:=X���X�Whu�����R#�����5 �<���}7�_��O:9h1������,w�Bv"�)��!���&���k����?J������[#D��S6��������/N���_	��w1�@~Q\=c���'7�)�����_���l��{�N�g���dh_(�������Z�[??��Re�_A��C����<��St�g>��y������k��������������"���
��Q}�$���
����������������������X
R=|0>O\��(.���_n#'������*'!=���
��<�p �
EO��BrY��,-����1�}���Y^Ni��ITV�d���T�q������i,���IAs��0�������zx���&�dUK�(�v�������#Zf���A�W^�~R���3���N���d�6����K�R�����r�\s�T�	9��/E�����8�|�S)b#�����/~���u���_��:m�1�GF���k}t�w�4�v���z7��LQ�����L�,w�1���K(��?���/�����/C����bVv)o���:n���Q?���8��h:�V�/��<|�f�	fq"2L��
/W?WZk"H�j���zgVV������'�����,�A������y""��p����_��)Q���a
����N;3���������'G%G����g��_UN-�:���(q�$r�1������OY&QF������V���]g��d q�������p��m�o@�%1��d���n
,4#e�,E[
���J��B���,�������2�>���(We!NT{����y��+�zv%8U�F@��f���Qy�����?��DfbG��)gN�<��csd�������QT)�����a����:��[��ky�U��2^t���N*�����x'��\#��3�Bnd����wLU�q���$��������t]����{������7��G�����,��}:�A��������q��<��.��Bc���	�u���0�P/��|���l����9*�n�>b���iL�C�MJ�����KMqj�������#��fZ���C��|�Ej��ydm�I��^������b���WVf�/���R*u����������8��M�-��z[ugB�V�V\>������,��{b����k��(��������Bf
���?��F����I,�v�O/�^�������:#A�L#����� B��C������b"6��H�]�P?�����CO*�mY��[�6(��c����:��8<��g`�/�DC���~��WzW��k,���8�`�\<�H-;:�	�����s� ��*��j��{�������F�f�|��p���(ov-�k0�#_���������0�����N|�TM6���z�oA�(�
]}����
��?>�~���f��X��.����-!gXfaX�������XC�o��>��~W�I�/��������_W\��.B��'��7��g�}�
��������T��n���L�[�(E|H>�?�8Y�I`�����&��������l����\.���@Tfm_w9{�������xe�6,�O�k9ql����:��{Pt{c�nx)x�)Tn��}p�]��9I�VU*�d\��#��0v�	
w����D����tH�:u�R9�e$C�����g������'
��ImV���A;�k��-e�{I^�"����vL�/���O���a�o���pv
��8������O�7�R�N�?w���q�C!����Y�xN������1f�:��<�
5ELf-I�������� ie��)���YF�����D���PcY�*��wT.���i�5e�\�-��4�u�=�~���$&���-(����O�y�I����)p[���H��
G����UD#�r�|�Cr��rP�;�|u�R?��R/�H,�'-L�%�E���%�!%M�K�SS��������7
������{��l��.�>�,:�,�r�p��(������F��F�
����}'�Z���0f��n�>W/��%�cKc[u#�.�~	�<�D�n3Y��	s�5�8Q�e	;@M���)@��v��6PY�3���H5��+�����z?^8�=�����I3��Hg^����-<��?r��:]��t/��3��0��������=�(���\nn2�8�d�0S'xFh8�EW�f��<A��(d7##�V���'�[F�q�3g�,[��U�/�)�_�]�7�`*2�k��:��j��"%��u�z��C�<!�y��B��O����0��O�!8���j�%�`G����S�,8Ty{j�n�J�������B��d���Q8D&8�H�A7����$�����}�o�2WvT�DS���z�1�Xj��4����Gqx�px�'�gX%4g�kU�o:u+�ex�\;�'�����-��_!���L�$�����z��:m9��qN��;S����|�������H\Q�u�����G_-v��2T�Jx�*_�N�\5�V���j����8�+�_i�9���F`KvQ_9m�������3�]�u%NI��s��Z�|_D�L�v�w�KSv��^���Q��"<�D]�Pa�
��G�k���Sq>�>S���!OG�%�k����d����f�K�/��q����y���I�x�yC��Q�4���T����Z�(������t���y�];|�������[=�������
7jG�A�X�>1��ps��q7a����e���uy����j��j��4�r��y}�P��$�v���[�{�l��}�Se]�0������B���+��e�%�!�tj^��]����]!e"%�
���*C|��l��/��[�~C�;<��%��g3R%N�j��gj��]���
LL�����i�g���F�yr���O�jIAW��j�YJ�'�p��<]�;Y�X���lZ��L����?�i�~W�����u�gk���1���_*N�x�(c���*�T����~��b�-e��~|q<��D��'$���8+�%��9|f��i����fz���XK6Umj��������Z�4�B���Q\�����w>���A����#��
����(�����RaL!���<Ox���C`��^�H�M���J����{�(M��Dn��v�Z?� ��g������r������X6`NU�*��pb@zU���3��+�=��9R�z
������Sw��q?��W�8a��J���w����v�����7�_ x���HN�r�������^����V{.������e��eI��;T�����)��dUygk!��BkMb��c7`���w�s�e/G��uL��a���gG��(���&y'f��r��7�yY�6���� �?��y=Y�	v�����_m=��X��s'���������$_������:P\�����"��+.�|���0Y��p�@�J�H<�m�+�q�yD\��*mR��]��H|�������{8���h��Z/�*N�E��T||$�����\��v�!���Os�6}d��a�S����jKw��m�Y�)+*=eY�m��?��0���+�j��G�������S��Op��~�+@�����,����Ou3#i��{� ?�������Dl�5�6Y�����NO�������[.1�s�oa�f���2\����*�wD�Z�r11���u���n�>����LMA��t�y�Z��Vng������{�������W����GC�$�j��D���YH�',�/
�rY2X�@a���t-hX�%���F��F�5:�nCR�\z$��gk���/�vW������c�/�
+�!Zn{��~�-z%������\����Pu�����:
�M�1����L�O����"��������~��"��q��5W��o$�s�Q�OF�A'���5���5�J��E��H�@��?��r�I*��b��)����Y�����|o"b��<�����k������8Y�mO���?�Q/eX�X�������K�C]9 A*�j��h���2�+���e��,4��������b2�b���������nQ?�P�5��2����
Ux>z��f��g�8�{N]��%E���`�I������,�rR���W$�HW�Vi,;�7��Z]Kq�|��RG�cI���5rE�A���p�g^�y�
{��ot����V���&�������_w��`�������$D�%S�:
q�j�\�
C~����!<���b��<cgK�\��f��!���|�a�jq�Np��c�.��So���������CXg�8�0����������������$���AjY�d`�T�������=X�&�T�+��y__�����9P�K{x���p�Q/LgK���6�x�ZC� ,��J����4�
"o�V�=�`�N���(4�,A�.q����
[%
�"�S�������
Z��/��h�5�����V�h�m������q���_��Ey�o��yf?��������Z/�~%M��4^��j4Kd�V�;����
���c���~��0���m�I����FD�
vd .�L��8.T�}����+rZ��~BK��X97|���"��G�T������M����b-��15~�����Z���[����b�Z�T�����l\���+�����&�I����z�A�Q
�~�m�6;���4s�q��O�l���`qMF��y��5�sv���r����T����J���N�� �����]��?SET���YQ�xB�6����?4��}"�k?�h�|��i6��KZa��$:��u�j������YR�]EfQ 2��^-W�!)#]A��lu39�
r��_j�7��Zo��t"O� �%����������N�	��U��S���-����0T�&�_������|�K�rX�R,:��[o-�1]�!-�A����V�v���c����
����-��dt��=5�v���=�MGw��v�x�v�v1�S%��H���Ji���je:��M$��K���Z�74���(w7��z�'��$`��H��0^=����#E8A���o����h�����]9C�]9�����������Z���p'�����������.�������}������Us�q.
rg�!B	lPSx��8����H�p��L,r)��-q,��K+�Y�0p��W��
��������k��������N����=�=n���#������h�����!���	��������Z�UV��d,�����z-���\K��7���C0�n��d:��L%V�P���������
q1w.'3��������{�A�2	:Bx-�:���H�=P�3�z��-�5�������t��P�?�S-���%��.�	�gD�ty�z�X�{6%���]Ya��]�[
��[��~�c�>���QZ�F��c�}������d��tj:*��>����CZj�#������m4�Ru�=cv�����,#�����vF�����/��e�&���c\����nx�e-	S�dN�����&�����^{%QKb���UH�0r��-��9���cR�����0���q:NgjE�����|����S�b�R�F�
L�`�R���5�+�~��Ga+`�ass���m�+7�&�Wl�����UC����P\-��<f�i.����E=n<X������!]����w�#9�$�����$�[����`� �������zi������SZj��V+<S����z�uM^�3	�Y3��O8�~�,����j�e��Z�D�D�mg�@1�����YA`3R8V�}����k�4h~���kxNo��DK0e<e�7-w�?����t9Y�J�������j;�_�v>f��m���~O>Yw�!����C�v�v2�9��o��w���9���F?1���H�S�2���n���	I^>�5�L}��Ro��������S
�0��:}nV��9��Qyyn��T�1��zQ�.%���A�v�C��gc
��'�i��K���D3
T�]�A�[Sw&����u����8��~�����+�U�������y6l�'��9��dW?�^*=��s���XJJ�����Z���a5�h��I���r_�4��f���)3K97�v��]����d��_up���L���+�����`��r�Rb������Dc����u���Z�
efi \l����<�y'7F	(�v�BJ�%��n'�+�j���]0hGs����	�,�]A���?@�����8@�<��a��1���d��u����b��:�"{5�mQ�
u��w��.4Ak�v#����3� ����f�*=i`�5�J�f��
*��R��+�n�f�{��N�������pm^���,�,�M��2Od���U�m*����g#br�b0N�v�(��?�w(�[�2�������W�������m��W�g���
��]�,���g�Y�W��o��gmcE��/��J��N���K�FL�%���I�����`���)��,�ot��'���\�UgF���'��,���f-�1A��L���`�6a�5��QZ'I=*'�#R4\%��M���b��
9{)�S��V�|j����&�=A������6�#�����*�����*������g�#��b28���^����)��`<��C���L;]����T1���m��b*l�)N�f�_�5��6�Q�Xz+X/�<������$�d�'?uLPb����Ce�9l�L��},����:G�n�[���t��$Z�e��$����e
����9�(�}[�+��Rzj���J�bmj�RrE���{��+:e��~��Yfi,�Bj��J�3��e�W�����O�R�9�>.-�����]E����q�0�W��1��>�_DpF��I���V	Dm&>�n��C�A��i�8�!�k�_���`�
��N�����*]�N;G��uh-q���y���!V��\�}���)hH������ihio��D���X]��mX.�����s���~�\|JQ�8����r����Zo&]-��VM����%V�d2�
y�a�LN�G���p�!�2�V���1��-c���
�C�Y�_U�\�zW:��)	Y���X��q����� ��X!}�)�Y�[�W��'?�3K���5�@*=��mJu�eA�2iG)9������cK�?p����e�����M�.=�D�������wgI�LQ��b��}�T/���[�0^���(P���rQ�����O��
���!
Y�OQ����y�W�H�U�kQ���pM��8�����������Nx�����D�@�lZr0�>��@�4��>W�[�Q���gG�S�T���F�rw
U�s+��U��<��_v\*~s��8�
�si�m~��5\b�W\yj�_���WG�c�
����`��CU�����}19����:���Md�������$�);"��9".Ky7��\�-N�Y1�R��^�9������<2��-4-��i��G2
���kOA[oM�
���k�H"�(���f�uwT\ ��>���E k�*"�2������E��;-1�!���>������s���Zkx��_��y1@M|Y�9�5�J�P���>��R�/q�b�W���Z��Z ��0���}e����7u�$�^���s4F�[��G��#�Rp+��~�����@�a|��v�I���^9�K�w���]��U�$��%((�]�n�6��Y�m�+�v�������YF��l!V(��<�f��g���z���������,�fe����dU��D�A�*_
9�f���y�d�R�$������
,��#w���kE�����U�xBTS��'!��h��k�����)��K>����k��!!��R������}��<y��cV�6�lD:h(O�������+<������h�XgW�s�FW��	�F��Z�v�TA5�sp�Nx�����9��8��9}��o�3S+:��{�	��~`�������	�,�����V�����j3���bij�Tq�����K�����g_���2��4�C�"Y����m��8yS�9��f['��*�3��������=�m�4.��k~��������<�>�(��5�$�)��jO�K4;��(�*�+��"�jG� >���M�=�7T���*Q�����Pj���|cM�K4y!��	T�F���,?v��
�<��8�7�I��X}A^|�G��q��8n�jV�{�$��S>�2s�@�~��%�&
$�;Gt���iU�#8�����{����17���c]��X�K����TTG�>�J������K�H�4<r��VI�����H��r�l����������r=�-ah.��f��Ru<���6�?l��U�9���J��$]8�8^?��O�������j�h�����=5��B�J��T�#����j &mAd�����;>E��C������`^��U���*N�K3E�.�����/��>YKK��X=7��'�1���5�aE�����U
#�G���M��������Z�����_�pO��l �����^�4� �[Vd���������/=YM�uw��S/LAc��(�K��1�.k����CHa��
G���m-���FD<�D��Q�XL6�ZW]�]��%gz��l)��8����^/F��#	x������_N��i�����B��e��A�����t�D��j�{���#n��,�A���W�e�����Xm�C:>��k2�������5���t��+�������*���5��*�1e�*�c_Qw�c�9c�4pb`�L&�c���%_��(�5�����#/S��(��: }��1�@X�O$����P��^�������G��5V������8������c�
pjm��l<w6N6b�`�O�J9w']��*nQ����1[0Y�C@����������3��G.=���L�5������i32�v����@\k������3S�p4���y���y�F��G�l��R����wU������{�t���>�����x���l�sp�I@2��:���+�d�v0���vHru������/1���K��8�����Hc8�������Pjx����i��G0���CxE���E�[e~��o��7���s��4��������?wH�;�:s�����.|�<��R	�����sk����2��T��UG�9Kkp�j�&���2O�d� �Fy��h�_���R��pM2��a[������1\�`w�d��j�j�0��9�z*�lA���)�)��?p`KDR~��<���=��9������h�=&T�H��[�cE�&���!��� �	�Fi)��H�s�47r�a+�<.�{�����q�[�e�V��Bz5Y�����6�4�u��`p�7h>Pv��f_ )�E���?����,�eP�� ��m��"��JNf�m������(�=h�R&�.�HwS��@��G;"�f�y_���[�0x=^�p����w���8z��/�h�oWy@l�j�`~"����\tv�����`�]����q��@n/d � ������l0'�F@���aad�/�Lz}��g[��?��Rg��R�C7i�V���6w���#�)'��������Z��J��@�/rX�R	�����}i4y�n/#��k/����Jy+�+����R��b�5"��<���;��v��\r&j����z��C
L���B�c�iW�^D:�u[O�d����)����uu�f����qP�b?�wq�Z;���
��7e��;������!��r��\�a�Yd����l�F�+#�g�����@N�r$l��m�	-N�&��#�V�%�{$l�	�h�?Jz����m���G���G��^��Z�*�AM�oi�d$a�.�g��*��$��e=SsJ�2!a��3\��y����\��h��l�O��F�w3��f�Sg�Jx�O�,�c��
�����5��eKn��!VF,���c"�b�e:~�s���XV��������Sw�%~��:�������}�����\O���=
�a�d�������Yf��9�2O�f7
c���R�����.��0���������A��������U�����XS�_V��c�6�����	L������l�����b�������D��
���
{3i�����W9>���p20I^�����"S1<%�^��i`R��N�,PA��7G����!��2����&~���._A7�]�x��3���R��������g�,$:K?�I�Y��&��)-����F��7>M��������Oc!s�������{�(�CC�w��Zx�q������T`�`�����]��-���\_En�|{s�	!!��(�Qk�D8
f(��Y��@�����l�1 ���v�=v
���r=3�<������w�g��*����MN�5I����*
1������9=�]�(s���j�|�<���
���2�����,�T������d<Qj���p�.A!1����9�A�`��,P���Rx6K����������Fi�����f�o[;�����&^�����=���y
������S�|(	^�D���\�S�y�M:����
�"��Yo�z�zu��g�S4B����X����&m|��-������7���������F���K��D��U�8�����{'1��}2�2�%Q<����Ac��+Ww����m�����h5n�����0�]�� ���v��gv��|:�B�7I�1H���b�6��Q

t�`�m/+��#��l1���wV�`�C_�K��2����� q���<�v����Id���$���Vl�mC"���*+m�B�wN�w��b�x`����3�YtrqV�o����J��Ee����?�]
{�m���Z�Q��e%�Z��
}�Q���}����~�E�8��g�����Pr�*����Q���'���	 ��7��;KR�������XL���bl��W!?�i�����k/����U�q��t�Q��'�u���7x)��Dy�QV���������r������Y�>�����xs/�x�����#D��~��"C��������1Q(I�G�&>�'�����I�����Z �^V�G��8QU/k�f�S��VW�n������#���>��Ak�y��K��Q��=���hN��aY�����d�z?Z�=TE��/�L�s��>�s�����1�j�s��Q������q���VRKe�������'��m���p�����R��^=}^��$���-��7H�uc�u���
��y�@U�S���2���]������;�����#���s���o���Y������?�:7����ry-.�3�w����X~�=������#"2����k���4���
��k��y�(��f(B6;[5�5�h�T*��p5�[�@��s4�����:"�q���}`1���(]���������N��k#[���L�KX�c������T����i��n���d����$n�	y��L
5�\�k���c7��R�����E},�\Y���w���8��R���/�T��>�9D:��8M�����*���
����;@J~S�ws��6I����;u�����e7	����+x�x0���|�,�+��M��j_U���im������:�Vx=�S��Q4�� ��5����W�6�����})����5�h6
GW���^�vY/G��b��/}�D^J�H��7;"����X�����xq���^}U���|4�9��A`<�:�{M���;p�[Y�#<*)�F_�����VRpQ�@��6�]�G�uC�XYh������5��X�%U�0j�d$�
��,jxH9h���Sh��s��yF�P����M���1����z.�itg��x�o����]Du����S���C���������	���� ���cG]��w@j����x�J�	���������	�/���v��,����*�} seW��_�: �
�;]���5����,�G��R���"=��ragEN�����k^��0��6��C�'�I��of�/ad�=����?��$����������{Xbdr��O	^%��@�/���I��N����R��@�}:x���y<����X�Z~��R�o�n��6o$z�[m0�D#��M�=��m���/�_S:����|�������kj��
��.��c!W�?�q��nm]a���%N�#��Q�}j������e�9��4��7����>{���E�Qr��+/,����P��^\\XY��8SVF~eY��?�X��������	V�f������;�X���q�=�xa?'l"5����� ��M��1�uv)!��g�������=+=�@m�n5�h7��j3��������>�[��1n�B���>��)+�	�2O-�,��S�foI����*�J����W[���J��������k^�^`���O���A��l��/O���[r[����J`O0'1X��������jz�
f*��P�������[�	����U#���F��w��83�>���\��>�A� ��e��^���|aF��AM#-�
/_Q
W��*�N��(N����rw�^f�t��Wf��_�x��T�X����Q4
�=d�{Ac�S�<�*>���<��6�W/_*�h���]5���)��/2LCV�������DR������,��hz��vP$b�f.��
�Q�� ��2��~�cY����m4[DR�"to/�O	<m���Q���6E� �����z����U���|����#����v�� ��?�P�R��MBD{�
��R�$XH�R�~�`�g�\D����j�������NCW��I�b����r��[����Lg��v���_S/�s�cJ��v#�]����M��TA��`+n?�m~<��d��t�4tq�[!eSB�W�{`��;���dk;
��
����u-�02�W����Ef�n���j��)��Z6	��:qk�3�n{*��������2�d	l("��'@��[�m�b&���m6�����.�����0�
c2���2�a��o_)�����n�X��vs�hC�0:��D���
6�E��^viVV[*�I���P�����@�/���{���^V�K`�"t��jp\��M>���#?[U34�0=G>l����9�@��^N��pQ.�Q������ZY�-�K�0p��K=�Zy#��i
�&�h ������.�8�?;�;Ow�!��-��R0Rr�7t��I�=h�9�����b��v����0���������B�f��zm1?\�&�>jP����������^���1	�"������)m���]��4�M��~�2���"%��Z��5I���~���U�u�,�qS�,�g"�[4�mI����8���O!�GR=�Z��p,z��`�F9�y�&��P���z�2�!��RI����Ay6�N��i�����������L�L�ZEK�\�.B}���:������_�ss�z�Mu���Vdf(��0�e���<e�_F�K>�����u[�� �����y{�����s��w��F�D�rsZA���s��oX�	5�~������5�u������q�;�w>-�<�~����nm����� ��1�s��0���FD:����^��-�p�����q��8�Z�_���(���?U�=����<�U+c5����!����;UD+�:>�n�\�gv��z-��&���������nh��%JD�3w�L����f}��N���������I����V���j���(��)D�R���F�|	X<�72���.��(�b"�pK��R���!j�p��H7��N���zh�qi���;�F�T���m����=�]k��k�w���s?}0��������+��t<�B�\'�u���!��ySo[Y��S������������ �%�;�)^bV�q���v���Ht����%b2�.m�+���v\�x�������gRNIjW��-��i���!7S7/�����x��R�h��:��F�b��x�b=A���T��C�&�K��sU�S}���o����t�i��v��9��������/0�=����l83m����o��.�J������|R��K�1��
�zG��!e����JW���L�5��O��I� �S��@d�g���Y%��~F����/b��
���1X��j�!� ���H����BG2+!������n�C:P��m����q����Y����$���3�;��E����Ru��<���,����v�Cz<t�>	�{t��[*��[8��kk�1�������?w���`@o1�h�C�E������9w�D*'�^^���7�:V���5�����t@h@�����f��Il�UB*�+M�"?�Z��R 0^5��j��w�.qb��l>�Xfs�i2T:�(7��kS��9yv6��o:�iA�����w���o��7B��\��5�<D	��W��*r`
E��ODAU%QU�FU�e�k$WE��?��bH���%��_������+����&�����]��e��'��b�`�$�Z��*T�]R���]�Fa�I�"�����,b%��Z��8l�<�������F������2����* ���}��T</�W�jL�S�����r�I���F��p��]
���-
goh���I�32��N[?��f��m�%Yh���^W��'D����F�d�����W����3����E���#��Y������h���I�}7%gy�k����L�Ux�|�$�J�����~]��^�U^�g�S��6a�L���,�_*��������#�)"�����
p�;�yu3ov���AKK��D���)�^���A�3n����Y�`�b�C��w�����������U}Y9�T�	���`;��Wo���qP1W����(��i�I�YmP*#P�h����`q�e�������*������
��e�!��	cN�.���~�h�X8G�q����pP.����I~	��]���:�8�}���H���-1��m���YZF��i��*M%Pg��!g���dq�����q����v��u�9����DD�e����U����J�h���m��W�����`���\�_�o���V����\u��/N#�+.�oqjSQ��l�o�
�bI����z�J��@�8G��NG����/TD�'��C�����<�N��u���k�Hw8���������=k��wY�M�=�u�=��4(c:W�W���&��!�������;>��v����q�B���Hl���g��"g�9;{C�&��9\K��v����M.(�d��jw��])"����WhkAu�����^@����[��v���_Nr�C�	��C����I������q�%=�*U��~2���������������m��)���s�D�Ix��+����������5����	9���P�O��$�Q��[�o�)Na&�&�w)��"�D�C���'�r�%��l�Z��x]�-"������x�.�	^��������x��.�����(U+�~����
���<�MV�Au��-��:0aL�t��>�m�F=r��03^g�����L�(�v�6�O��11>���p��:�*�Xrq��������D���3��*�"b��Dx�0�gl�!���15F�,Vd�������Nd����?`��\����R$��!L*KM,������E�f�HD1���n�_���:dxs�[s��z�W�D�C7q����z��$�<�y�8����kn��k�E��������������&���9�N��M��3�K�U��/��*t�W����a�Y�
���0>�������d�����W�G����*P��
>�E�t��gb����P��#��������a�1n`%�_��v\����3�7�����L�_b��d������EN(����J����aHs.aHC��{��N�2��"x ���#���Pt���9	�c��4s��#f[�$Y������o�e1�{,]AV:���]a8Bk��n�/����OS��`2�����c�QM�;E@���@�x�"c�b��m"5�*z�m
���!�Y�7�>��P��a/�xQ���YC�Fo���a�?W"��G��L5��!���o�,t�������]Y����O?�GmQ����(E��N��H~���s���(���Lpo(58O.&m�k�\-)�2�g��}���Y���35[.0�l�,�`~Q��u4���(������k��uk�q;�C�Tq[i�{U�@�������w��n�a�^SXP����i����pPv�+��]��;fj@��@ub�����������	:���]�����=�_c�(����������>����A��cN�����U������'�uI��P�0��V�$j�����lc����{4S�6Qw��b]��%�������9��+���$�D	��w���� ��AZ���Q(�w<1Y���<y����)���P�������:��@�s�{��'+e��A�#46�>����������o�~R1P�}��X�$[mq��&�
]���-�%�C������A���o�B��Uk��WR�'��t�@&�����%�HP���46k�7&*��������S�@�w��x��$C_�1R�M7��#?Ve6�~
���+�������e,\����6��
��/�go���N�:�s^��	mW���}�-��/����>�\���=��������?�~A*�����O��Z�xiox}�w�s�t��%1[�r����PDP&�7����l��������Q�@�ca��Z���QAY�.�D^RTK��^���
���Ac&���cT=�##<�,-�(�*������09l
���N���������@c�3V=�{ls��\z!�{��=������	5��L���``r�Y��`��#I�]��~d�������B��Y`�@w�Qi��K~a��w���:�EF���*RS���u��&�����N�������b���G��+�-�A�!p��	��CDV�<�Tbp����=���|�V_V����QU7I�m�)s�X��x��Xgg�\$�M�6J�����H0LeW���S�-L��Vt�Zt�/��
=o�a���H��)�9h\v��r����Lz����-MA���;���*��/������y�I�@,.�d�/D�4P�(}�=�c���a���#
�?�A��������W�X??�g����1����#�_�x fo����1���@LP�!�v*���BE_Rz��y���^��-�JU�����+[���J�����/-�ZW3�
'k� H�w�)�����&�f��xK���GJ�����|�in?r�x���#�������&�Pan����arz�����=�]�8<��c@�����n���M�>�}�pq�8��Fk�5ke=!gC�z9@|-���S���������B��
%��Se@��18��2M�e��D~%g������i�I0������CB�s
�0�ZB��s��������&`��:�R�	��������7���^@�H�J��A��p�U���*b�$�>8
Ueo=���m�����	v�������{C[���[��M������h�����u���U=����32o��k��_T���L�����QU��?�o��'C���vULzU��/�6����/}��z%~��x�}��78�������o3�i>�t|^���/Y�s-�i'> :��]���K��k��K�������:�<�H#�II��k��~���|��wB����x[A���5��'2X�����!s����_Tv���gkQ@�:[��;���[�����N��d���6�U������Z �������P}��A3�;M�4�@*L<?����T��Y����cM���4w�^C���f��?�}�Y����]���O�Y��=Q��AK��&�����A�+���g����>H�}�)@[���@��m)04��n�1�CaI���m�f4+��r�����]���A��%�[I������0s����u����	X�5-7M]��Il�H��E�j�������,"���9��Xgj^H
c�~E��K�����t�r�R�����C�f����-�	I�HY�}��7��T$L�����x+.(��X����PD���m�b��D��)��+@��yQ����NY��VLz��Y�����x�{���;��	�A �N�z��4[��I��D9'?[��O����g��1��x�0 ��X�$�;�b�C������Xya_��}0{8V3���
���vc�J^���;{�9�)�O�g_��o�p�"8+;v���GH�D�c��Q�:�>a���N)��`����\�[�3YW���������8���w ����d����oj��9�j��#$�������w
�F�����p-gU���S���)�6�8�j�3�N���~�_o�����c���bf�,����������� q���`'��Ud5NV����-TM�6�H}}�� ���nJ��\ ���f<N,���M+�������b�I��D)Mz�e�d�1����>=; NJaR4����}����������u���������l�����H!B��\���!���=�S1/����~}��}���u����/�YV���#���R���-�����P7��`����S��\M�)�X�PS�n&�&����@��@�����'�����q�����v��l�~W	'iLj�$�~!^�/��u�
��)��_����&o$&�:�
�=N������}�{$����0����D��^����-Z��G\��]��?&!�Bd>�`.�0��X�g���T���������-&��]qh�t�5?|��&+���i�H�Z��6g��T.���l#k�u�u��B�C���$t. MxP�!h�X��,�����&���J���~s�iI��:�t�/�[�
�0�	a�[��Q����w��
���r�^?�4x�
��%sjdT��d[�4��v�l���m�*v$���=>=��q������/`y�p���,N���[�	�i���0U�,�F��z���U~�l�w�y�/A`?3~���5#;y���"�CR�����������>���a�EY�t��p�T���d�
t��Q���h*���D��������(�$b6�Mw��{��4&-��"���p�|����?n&?'�H���
���tnB`������q���v���j�n!Y��/,v�Q8�3�����$��������Q��������Do>�������u�����8����������M�?FN���:���-a��/+K3l>��f�p:���_-�qU\�Z�����(����X�����8��� �$44�O����5o�0AZ����;����P&�r���40c�}����"��|�.:p�F�9 �w%!IOC�4i.��b�H�"�{�w��J�3��BV�����@��I����/X���N04� c��G"����x�����*8$y���a��6�F�n�u)��o��B��$0&������{�K3��I�
w'����	}�E���L�ax�rtHE`0���Y&��������8�[H�S������@�=>�sB���V���A�������J��Ky�Ot�,yJ�Or��������>
��q��l�v�@�k�>�w�$B�c����NL�����?���7�A����@u�+���S*��B���|C�*H}����i��M��,"��1MC�}'��_.Bl�Z�k���pg���xH��!��~����e%�a�eo"Ul@�F6��A�9��2�������-`��@}w8t7���rO&2�{J��L���C$�0s=1���o�O�s>B$��|�~1Hj4R8��b�@��SG�����'\h��u�B��Cc~���L[4�Ep{��V��@,;bt/1�^�(��IVf�C�W����>	�ZR�l����O��m$�k��z��{���v�����v�����	��n���.i�&u�So�z>��1dG
}#�#���K�%�������<]��S�����9��0|����`R�[>����R,_{�Yv?���_��7q�:,Z����jP�!����[�;
o"���>�%�9&1:v�)���y�d&�c"�f���HC�����q����AO�+�������maw�J*@��;�/��[��r���t
�LrK}��g���
��Tb/F���_�����$�C]��.�/���2�T���.�=*&���3�~	_@�@�{N28������`���3}���JR�����F+6�_��3�����q��h��5��I~)��Vc�AX�#`.�rH�<1�;�?m&:�z���������K	�L�O���r������� o����-s����1��y0����
m���IR}���B�QB�[a��I$�BT�	h�����	��x���15�t#x�,��U���p(������M�:�{!������0�jB�H�ch��8����\#m�[�>��,��3�K��	-����z�	�]��5�+�~��5��9����md�`��g��v]
Z�mz$���G�;��l�����S�cA�:������\�n��8�K/�E�����<�`���\N�� �OS}'
|�^x0����<����%����Z
>��.���5�����vV�J[�gR&>kV�x=��N�[}}�1+Y{^\�����=���{Oq.:���Q �������i�F���'xH+.���b����-\�iY_�����I�������u�c\���~��|�)��,e�Ft(��x�{���V�!����!���o�4�������]�s�s������PPR
r����||�y���V`���������	��Q�&�?#����'�\^��]�dV�������R�
l"�v�
�,m�E�'�d���A�U|��R����3����H)���TMe������(�y�x��W>tS��{D������mm��/?��( ��������B5��J��q��s��
�B"!hT�ss�})�3��	���j��w�>�6|����Y����,�M���_�g>��+G0�'������)��*v��w�6/�O����N�-�A�`�v����E�(��B�}dn�����+[�=��y)A)lw;i��p�����~��Q��}��7�C����:[��i���x�1�j�B�FG���d4�Yu3�@^�D���Gw����B�4c_��-q���o������H�&��*83s���uq�_�a�)�$R�h��Q��������=)��=(�����CfP]�e����\8��y��0C�M���{���Y���z]�$���>��@��!��l��c�Uu�/��)/�� �/�RWvT���{��km{>�^�����,=�_��f"�����~z���X�{	���4z�x�8��3�{$.1H���}#2�^�Yh� �z�;Sp.���O9|��OqN����rZ?F����`2G�[�x�G��J��t~#�(�'�R,����&����,b�?�-�TbY���Hz�	3s	��������@�	�xK��~z�6(R
!��K�TPy��K>s��9�fZ��I#�9���8��V������V�
7+�����A{2P�R��Q�
�@70�j�)p�T�> �dU���]l��'�nI�
L����AKS%l��m���a�0�R@�����;�T�;��/XZ;K���<mq`��Y���=`�YN��'�>L�i����.��B��V
_t{�����%3�c���[�����u�wEI��5�r"�dJ\�������Lv����3\v��o<^Y����T|�n��!��o2-��Ad�����b�?������t/y�����o35[1����uT9����^���zF����[5�����(DN'�d�C������!2���r�	|��-����?oA�b0��!Txc��9@�����3�������=qV�#T;Y��Y�kK������������v��U����z�"������X�o��ChE�!jA 
���i]����-��W�E�L'��d�����c�xT���P8���Nq8���P�M*���KHxr}'P������m�{��F.��� �
���O��V��.�]���07+ ��;�F��!3�%���K�	`�Z�n��B+)��`��&�)!&dS�=�����p�\xE���I�f{��;�]	������?���P�"��s�n�o�u�+�^�X#�A#V,��{~�w:����2�]J�=��
��]����.M��bz5�T���?/&/����M9>y*�;�(��N(MG#��f��[��V|)Gr��{=!Ob�V-P���.��O����>{R���qQY��w��+����)��\
�
����U����q
b�;�~�X	M:�RAac|�}���Gk%����V{E�y���X����-�f�|j'��w�9��#G���B"��\j��.]
maX�[j��%*Rz�#���C��\������Q:a���0���H��Ab���3p��B���A�-�!���)�����5PS���@`o#�L��d0�n!�w���Y����B��07�����f�F�}��x�}�gBe����(Yq���%N=s�]U��}p�Y	���h_�������"��/�����L0�i8�B\���"o�����K�bY�N'
�L}O |�1���7�=��s���wiy��uI���R�q]�>N`E.9��2-i�����A���%c����}����"����} `����h��\(W8��*����E��C/��d��z]������C��tSL������K�����%�k]�(R���U�k����i�YHS�EGA���>�oE�����0���	��m����b��A�1��BS
m��U('ff�>))NbG����@�U�49,�`'@���
�#�mM���;D�zE����'���V%�����U��`<�#���P���	'�vb��$Rh�q��e��P!�D?6������PZ������/	��t�j	.��?0�7��a�@�@����(�/)��_G������:acP^e�F�h���4�.jb�[��<�`����(���;y.���5��,�P/`��z����`��q�Z~��l�8q���P��OQ��G^.k�����{��
O����
)����JXt���u�K�6V�:�=�	�DX��G� 3������}*�C�������\W(�P�:�z��A]��cY��H��t�&�;`2c��*
���w�U=����G��
��A8 pi��n�1�2p��g��Wn&�����Gt�����\������Z4�(�"�F����K����[_xsf�6�RC�!/�9�=o����c�.s�$�r�W���[��Y�x��\N�������A���(�=�
�v��j���-90�Y0(^����'��A��`d~�j�������%��*�W�r�z��/��l|{��//HVd�o��?���o�v�:��U�@�
�vS`��\(MZr�+�K��,Os)R�j��RK�����M}���.R�j���gdhK:�n�|�t�/��V.Gz�����4/(����lC��������)����e/���2�N^�����A_4	L`�&�`������/���g	_���d�R� 8X���L���y�(��������%��$*^�	j2Y20�b\���dT��s���c�>���u��j��i
�`l1 ���W��6	7c0�D�H�_���=�\�;uNq;J��{M��T;�r����wB\�^��� ]�j^#/0b��������!T�a��0]���}<������A����*b{p5�/P�T�.���/�m?],R��<��t����<!�[\��PC��)��2o��W&���@���P!��m�#������V�Q������T]��k+�%�T���"��Y�=H����z�CQ���K�Y�����"�>���B�����P|;���n8�Y���A,����je.�.�%!�_���&�Xj�#�hW9�#i�?�d8�9���B`������B���|K���M��K�tPb����-���J���-�x�q'�I2�PEy�=����Sd�BTJ$6y�O'�!�u�B��6���|�gG���MDh�g�}q��u�eS�c�����P#��2�}�8��l�&�u+t�k����H[�4�)��lPf��GW�����@�������|���1��"u�N���;��������0!�x�!�����Q9�#�-�rU|y9��lD�$
����}��]w�$bh��������3I��O.]W1���b�*7�1XI�
�_�����RN��8���Uc
W�e����p=�M��T����M&G��r�`zy��&��](S�=X�W���W��Mj4��d����\�s=T���wG��5fc�����"r�,Q�C��c��k~�o���|V��C_t>�
��J���[�p@*Oi)4�e�d�+��,(�`��fS���i�j��)���(����A����y��G��������3��q��}4$�5��[6^���p��_���G���;����x���G���b���(��yE�O O�p�d�n��=�$sh������
���@�^��}=����p�X>j|h�������h���P]G|�%��7+����?x�_e���a�\!^�����J�h\[m�<�>I7f�FW#��F�<��|��� ���G��@��ErE2�P{t�Q��f@�v7�Z(d�����c	�c$��,

�kg�}��j��.0��)a�@�����G���Jw���
F3*��#0��+�t00sl;�6ghX�
������p�}��!dx���9���
T�� �.��|pl�r�������ggW�������������W��^N�����\��2b
g�����5�
����%��������<�Q_�1+��8�/��k�g;�a��7���p&-��|���p6�����L����d�����}�Jm9������'��*���C�x��lw�[������|�����TT�c���L{��Z����������z�s�O�����{�/������w���H�r���,����=|�?�1Dr��N��|�/C��B�90s��l��&�+C�����fLk�����Ir�#��Y��������q�M�����<�����>����k�E*�z-����GM��E��M�t�'ZB>�V+���JB#�V�.s��[�2���:|���~�O2�Y&a�37l��T����0d)h��"�4��.6An;�����vTW�[��4=#�B�*�������-�'�_��Y���(�L�Iu��m|jn{���^����Fr �������@I��X�\�M��T������hoFW�f�$U��A�,�����?�y	������JX/�B�����t�?�xK��n��F��=�N�DE�������c��Q�BW�j��m$���"��mnA;TA�7_�I�^��������E(hjz�dv���p�x���-fC�eEy�������*&R�<���0�����Z�������
;���i�XZ��+������;T��q�q�BP��k)��Y��|���K���o&���b���g���H�����'%X|DdV�wI��<��<��$+B���Y,Z���d��\�Q���de��Vb�X]�����Mx��lc%�� ��`�H��|����~jtF0�T7���,�Y�?���tC��%���UU��}���h�����������V�P�^��<���u��9rQ
$X|��
X�H[�i��������v%�Z�Y��N4�D0]+2���!������h��2������%S�*����\������g�����`�Eu���:�����L��`/v�G��sRCt3�3{�wYp!��i3���Y�V"�OZ�/_t��tu_nT�����1��x[A��&�V������G����R���oQL&{|-�P^�m�)�h+k����r���^��G��f��!����w^�8���=R9
2q�V�1Fu��"�r����'���y�KC��UC��G9kA�h���	�y��	6�`�wQ��@��'Q>O����:��%�H��+��]�L&���@%\����^�q7}�p�a���1V>�&��>T���$����O�"{(f��(c-T�����3,jKw��4������t����et�9`�l�r����cy�S����6��������{�jG�����	f���j��kM�
]��vJ]��+MA�����^���\�N����^�3�@����zk�C����Y�:��>�/h-���+*_�H�,�h��.�!Y�D!k�����s�V��y�O!:_����S���S�U��-�������� ��j���5�
[���.�����i���������d�Ng��&
�c��!�e#���
�������jw�F����F[&'��[Z�������t|��4w!@K������
����;ed>=�4��%���B
|����^�J>�-'���r�0�i����|���P P��kE��]��
�/z����G�~`����#6��":�f��+z��)-3�����
7S���o���xO�+�c���+��F>�5�`�m�����e����p�pR�����5#��;#'�t�YT�4�ul����m��!��P<�$����sm��}��z<I���f��9?��s � %}x�����s*[�������o���c&j0F3ld�q�F�K�����$�$��u��"��h)Ha�lA���3�KC��v���5R�
��Py >�D�_�#��YWyz���\��|��r��Bi<cH�+��W�E����.f��Ek������y��\����`�E{����[���WW]oh��}u��gF*�������]o�r�����o_�����R��|�����
�_���&��R�F�C���|�������]����b+/4OP�5�K�x�e�7Y��AC*�7�
�����.��NYtiL�������bK���@�%
{����Y��e���Eq�j]��M2���a-�<ic����i��o����H�R���Wn������K9���C������f�{W����7h8�s�{������:���G�M�*�,���4U�L'�%Q���.��41�����"K�~�OB��f�2'q,���<q!��7�yrz0�:�o�m��)��������O-+
�I^�x����d���9^���Ub���J 0���TH� ���*c�	"Pi�U����sZ�O���"�<����J6�T���]c�"����';����j�������\�v0�-,N��	���<+��tPr^�}���XM,��K������� ��j������M���<7��Y�!a~�`��2�l��u�R�q�|t���R�
��)f��=#T�>���g{tr�1%����,�~R����O&�ybm=�r��T�����%3��u7	6Z��Xd�
3��H-!Q���?��u�S%89�>�SN����O:Q�4��_�}D��t��+~N�"�����~��O��>VDz������]>�������n_/�l���qk������)�����
�M"i��8��

���C..��^-e`U\h��mL"r�L��a ���c��1�df���1�{���Kn�Q���Lu0�z��}}6 ��y�����'������S�H��V�v������`��#���)�:pn����o�#��3K����y����	��_����������?%��d�2��l2�e
�����m���v�����V�Z��	�6$��[&b���j�o�����P�6e=�
�4���@��m��J�������B��PS�r������h�!��bt?��TH�Xq�����
R�(1]�^���81�{-�Q���������|��?ws��,����U>�DB��9g9h�3Of���m��H������s�^�r�����l[z{8��%��ERL��C��2{�F����H��cY�������Y���>3&��U�t�Ns�m>�~;�E��h��i���T�f�������!��#S"�u3L�G��^�3�� `�*W���q���LY~����	i����ETZp�z?C��~���Uo��-������|����?Z��s��9O�jK����-�A{$��$G%�-��W����M���T��E��M���~�o�}�#��,�����~��<�/��5���m��zq���%Lrt����E_�d�����>�&B�	#���F�s'��}�W��������!�d���E��e��� ���q�a[&����k�s�0���;�������K:Km�i�Q���QoQ|<p��)�E���4������'*{�e"dV^wI��e��q����?��r�a
u0��9���X��gS�cS�u��=��
���O�E��WC�Q�������?��+V�����T+��#���B!-a^�N�;�PoA�1pj�\�]�����I9����e����(���C���e��(����B���h�I���}Ot��@���g^���+9�{������vA�_Sk�F�y���
��k������\O�pY��h~M=���h��_:�3�8�2��*i��3aV2�U�K��
�$x5�OV���m������zS�H*����_��!�s7b�j���E�����������$�����zf�Ume>�\�@|[p}�Z��5X����u���D�8��*8����F�i�r��d�P)p6]:?kF�t^osC�{�;��69n�D�a'sQA/<p�,�5k�f���zx��G?���[�8t�
6��SL+���o������.�v�90S�6,�w����������fV,_xw�{�;
�LpOz���}���I��'��_������;s`�g~��\q��,�����M���rO�>]#}�q�n=�W��h�p�c<C�����g����
��{������ag����X������+��\�v��o��w�����n�t�����jI�n����~s��-���g��`J$���Q���F�6��J�����������0�7�d���g3��~�I�?�;�f�!�v�Z�|���J�hy���r�1�d���2�d�=���oU�
z����
x�����%��L�7x�:9���r�l� �Y�������N��U���cs�[��_~V�e]?�S��q���
&'��IU���Z�_n2��A�e��8k�fG�R�d����~h����8��A:ln�% �A*@Q=�*���\9�_"vm1e��l�R�
�*$��@0����?�4O�{��h�����)h��&�� {p�m�g�C�IL�lG������a ��Y����,A�p��{a�c������T�������uE(�Pga��u���5�������B3����������4�'���_*��u3�	p� �!�y�y�R����4u���[��'��m����z��w����uh������y,(H`�s}�����7��*"�>b��^��&Q�R�F$������
�ltGs`ZX�:���q���
�8�V��5#�4�H����mq�+����|��q�]L��H���lQ*@�4�D���.�T
���8����~�[���5�\�Q���������t���^p���^
�=:�2��F��3�fv�����{S�e�X�\ |�O|X�jt�b���v����	�z��&3���#C�l%��R�zG����Y`8�{��":�3G��6Ji���{���gH)6��)V�1���e-5���KsS�S0O���g�����t�P�6F��H[�B��N���b|j�rE:vk
��?�������K����*��Q��]?��wm��4���G4�kq&v�%�1����8�7��+V���=��y�B9K����/��l���H���^*��)�?�����G���gh���60{��{Y��W+�Fy�����B[r�������(;�
9!�wI�]�2>'�^��������j)� Hy�9�Iui�@ !��b2W��|�t���E�J�������4K�-�GJ��$���D2�������g
ym�~o�K����'���i��V�(��gu���/�5+�gz��S#`�+�`���{����*B�Y�!^�>�fS
l}eg�ZL��m��?�O��a�����vs~���yg�7�~6:@���J����g���v��T��dA�G�j�/c��������@�\� }�gq;���d����_�SN���7}u��G����3�l��=�y��\�U�bw;�t���`|��2M�����y��%xS��:���>��PTe�;���I��=9�f a�r���5�r��?4]6n�MsSs��N��N?>�2�r��Je#�����)2�B�#-��QJ#o�z�{0���P���VnDO	kV��G�1��)	6dN[��F���/hd����Rd@Nf��7q��c��x��k
�E�ys��y���JZ�\�l������?:�����~���`�l tv�u6�KK��!���=��=Ek\�N�1#7�:� U��������p�j����@�?�"m�^+�1G	e�yT��
�6����s�~0�
A���/����
r�(�dR���J���%YY�9[�b��J*�����"��1����7w:b�4�7����fz�
%,��>_�f�]?���'1�YI�������}���z[�����
&�2���#���Y�R_��gY�}�>���r�3�e������$���7Y6��t���ja����1��OC=5G�
sO����w>n-��?,����Ia�No�����Y�^1��1�������&\E�;��C����T9V�L���mo	"��rs(q�����k���h;��m�����_�S�&*-����}��;Nl�&L�a���+�W�x#���9���$q+�a�j4%���+N���dw[�w���5`�J��P��+�����0d��v�<H��US�������3\���28��*{�C"�#e�W*_NQ�H*H�6�N�L��z��������N��p��a���S
;��;��C��4'������?p�>a�\���_�~.�#��)�E��;B	.����������b������;�O�df���-�ZIEr>>�./�d���lZ�294[�Y�$a6DM��E���b��qm�x^������G1
-J��4���Z#	�A?��z��>mv��
]=�."��	G���������x^��~_+���L����?����2��VX��WR�4��Q�f�e��f[��-�5��3�����8�h��r�8�r����q��gE����a�
����W�r~{���Fr��� ���~J���ja��T����� ����"�dM����n,���6)H��MX?���\�BU�V�����GK���F�2G*��cW��(`;��l����w����������s�
��[���F��)	p����P�;fH����ob0
��� �$��8i��_����h��_�� ��2������&X3|�q�W�^v��:��e����t�6�F1(~7Rr�0g��&� E�@M��1��0F3&��Os����CU�����';�����0�h�3a9C�hJw��n/������4���	P$Fr
4HEb���47�P.^[���`�5(�yK�G���j7?�<L��foc�G�
Q7�D�������c��T�:zNJ,�����Z�����v8S�^��%�~]���hQ�=����rP�)��L�8��Y�_6m����9�dvTq6O-*DDu�R�z�,����`��fu���NT\n�����4r����hfk�u���k�U��C�-������et�D��p-$�>�F�rbt��qS)�����'q*��Fo��n�G��Mkt�Z1�i%��52���d��rW��C�q�g��5��2C�[|���;Os�2����8�,|�0�������y(wf��R��XEC���2�l����[��w���;6�7��Zq���x���b]���w�!������h������D���������{*���9v�2������&�s�R(��7���y'����C!�{<�����j������������$���3�����D�X������c:��Z'7�rw���
G�x��L-�>�qO�o�PxV#uB��
y����3�����n������HR������4E�)�H��������j��LIT�D>:qL�"y�k��:��1G������V�*���J�E�w��)�x8%��a�>���-�%�9�����I�����������G��G1H�������FU!���]G�������N�<�����?ub;��;�
_��L�9N��s��L�V���@�V"<v.��p��?6�N��/�+�0b;3�5���~�ED/&��A��s�^z(C����nz������ys������@@��5_�� 
c
����iS��]�
D�+mR�Un�D�+R�a;������"����������8w�;�+9Gt����s!���j����=����R����?�I��:��^���:k��~A�Y������Z�����V�S6�����L�}�E]�|@
�-�Q����~����p��gA������hM3u��������V��j��-�+��Q6�w�r%�L���<��^kl�;�Yk�nFK��{���r��i�v����Q�Z	H8M�_�BYa����'�QWF��}�&��*�wt(rd#�DI��sQ��'fYB�j|c���5����+��t?������?R��?I*k������������
�s�����7������Z��k�����y��q�p+��M��5�}���Z���l���1�,��N���E1�x�-���u��H]�p�������������:3E���N���+jJ��"Z��F��/�jsf�F��a�BNx��V�X8�;i ?�k�����G�s�~Hk�����^w���(0��<?�x<w	���)��Q|�N���}�����Z��FUZ���4���9�Q&��.�ou���s���5C(�9�����JzBAT�������M���u'��a@4��m����0�R�>���2�L��Rz��td	"6 ��X)N�HK�iNeK��)�@�a��>��������'Ug�GwQ�S�8��D�}���q�
��~�I�@X>e���x����>�)���w��%�l��/����*F����R>?�[�����@1	��ugf�N8�/�`	&�p���j��I:n�I:�h���!�����x�U#>7�Bq�������Q�@9�����ub`�������97w�
��������1@���-PWX_\;pz�R?�j1�����!]�T�4����r��������+uvDQ�
9����U	���	��
����7���Q�<4��Ll��
����r�a���<����
����ox��������I Z<�����L���g�W�`>�������A%��%=���{!6�*0u���Y"��E���i�����'�fi#=��%��f�t/=��U���\�P��y�� ��Ve2�uV�y�$7�1����B�.��K�Wl�J����u4�2d�0����|X���>SR��J]*$�w�+�1�\��e��|&��?����5���c���
�t=��T�|Jg��XB�����%4{Y&����94�E6���^���/�$|_���xP@ae�;�|cU�A+�V�\����!6/������gO�aM=�d[�J��I����5��2�diFEfe/�`~%��2En��X����S�eG�R��eTW��F�BO0���B��2zBx�������';�[�5����T�f|l��%V����g@��*������f���oQg7�7r�����G�z���Y��kn��6�g���`�}�T1-!!���C�q�![���U���f����J.�4�����(�tG�c�"���/+=�n/������,�f'��`�)uZb�O��?:���,����Vo��Z�w=���^��$g��/��V=�*����:�i[61�]�Ajf������e��"���!�+�Me���������\U����`��W�-`8}�Q�"��}2X,����nTc��07\���*W���O���B�'��"��h-�W���g�|����3���V��6�w"�����%�4�;u�H�{$��f��������s��`�,
�Q������-M��+���:/������o�`�_S��}�o�	9�����
-5���St���<[cQ^Oz����K"�����.Z��$?��9
��A���~�fECU����j�y���E��,��h	9?�KW�'��oS�k*���U*UA�������`��t���~�m&�$��)>�!w�\B�Z�z;���t_���w�S�mC�8~<��T�J�R>g��.�r�����
}��Q���a�=��j��&,J<�,>r�4;��*1�?X��$JL�9�����"R�����Uo�����v�8u��p���(����%��N��S�[�I[r�O�_I����4�aV�m����z����h~�4)b�y�����sIZg�6Q�k� �^��>��y�������Z���������bY���u���Y�ov��M�0����|j����_��N���k����K��n����Y�rnY�kC��������94�������[���`�0�'A}?�"R������1rY2�����G�5�?�XvwL�4��q���3j^MI�e}��s�|R�
&H��4S��(+���������;�b�v��a�YX+>����@-h�)I,��E� ����y�3N������ '��C@�X&]���%��0�0�4;��H����O;�P���� 'Ae��^��^���.*%�����i�)��:��d�%��6�U�T�x�>8Oe�����[�+�x��(2���u@U��7�������H�  -" R�]J���	E8��twwwHw�t	8��(����=������k��{�Y�f���Y�f���
6��vb32
?����xG0�{Sr�R�����	y��$���U�����������1���{
��[�U�������P,�?�}��6������Hx��K�E~%�'nC����U�|eI���A��cM�A���L��N�I1�T�f^�f}�C���%��m�67�k1f��.�t��d�VNN�B����R�&<��,:��3��o�� G�g_�2�����i����r�j(�Q}m���q
g�>Jx]��Nw����[�=��!���f�R3� s�
����o�L��Ul�?V=�r�v�����a7���46V������\���Q�3Wh#]p��|�]m3���.���4��,�������>J����?jx���b�MN��0���~���R%��b����C�Y�]���N0p]���e�L�(@�f����(_,�F8�"���y9�A^k���zc�2���<�����?l�aL~��38����{`Wz�]�#�XJ��deke`�'���s������xQ����tHW2ao3c����4yA�.F�]�7��1�������G�j�����&i�rw�>=�p`���2"mh=�����$����]3~@���9�����p�{�`')��D,d7����[�KGY���&DS'�N�2I�l�G�_�\�3M��j���$��������
q
���%V�y���V��X�?qhw��z��V���JRL�Y�E*�U�ej'i3�Z�W�T�-�%�y���Z���R��M��]�&��we��=u[
�)dja������p�0���"XV+�!�9����2#�)�)7~�������M%>��o��m��~E����S}����5Q$^��D��M^��P������k������:<�o,��r�Q�Lc�}`k.��,����.��������s�y5J`��_��|�'�"�\�id8sp~�D]���$�{�������������<��\#8u��<{u�\�hc��T,b@V#�����F~P7�H1�����1���h�����i��b�~�Z�����Q/c��vQa��2'�e�s�Q�?����^�������hD�~�7+m
M����B _�l���i��>����p7������w���^}u5zy`�E�LQ#q� ��W�Z��k:���4��Y�-t`J�����T����h[�Ba���Q�����p`����\8�:��B���p?H�V���]���wh�S����&��]���U��F�1�����UZy�����v������;�;�.�O��O��JkWK�����9"�;Hd[��9w�k:�����3L��K"cq3���0|:�(��G��_d���/2�D�+�!U{��x��Cc�"���gAb����"-I���y{������@���F�a��}�N!i�[���7�P�T��[�1O,�M?��+ ���j���C��{�!*Mr�U��f�� ������9p�.51F\:�zUq�O��*�/(	k��*?>*�b9���Y�j�o�����3
Q��
E���$�hf��l{(��E�l���X�����uy@o='�8%I�������c#�m����m�i�/�nXa����'��	�B����4�s�hAb!��U�KND�vY�!���yI��y_��?�S��Q��1V�J3��K�����F�GviXkS�f�h�>�O�91�(
���9E�*[��R.�V3�|�fH��s���\f��	[6���C�+��N�e'�����.C������q^#w���}�%E����t��NbN��<2h�n�E�/��Y�*)���+����dY��~y(��iS*Hw���rT�*q�H����:=�P�b��A�r7���##dUu~ ���h��J_zm�]�A���j��o��s�-3�tj/�y���T/,����@=��[=��������'J����@���y��aeC_��}3�6�%�t������+����q��I��K�����l�:�
/Y��51o*}�%����5���f�+� u��($���L�,�R!^s�Mk�M�`��rJ��G���S���hv�JH�n���oi���b
[������'&q,��?���^��tN
�O�c_����Z-�3���������8)K��+�k�c��3��A�9n@�����#�H��I(��3��l�V-�j���m�+�m
-��3'��K<sbr�4�G��ES���9�"������Q~��o*2���i�W�sI��v6�o�"1�
z���}�s&�4�Fm�2 �R�|���n1��9}����S0�Q�0$��w>�h��9����8������>���^�p�:	 +E:��}=�����������U�N7u�q�Q������qBy�l���n��<5�:��~���n�����dT����suhn�l�T��N���i4L��c*�S�&�p���t<]!�I�;����@<\��b@a	�;SYUd2����$<SQ�>��:��D��D�!�iX��!e���T�~�x����S��O��4�B(��_:����r�����vT��}�d���y��o���@��e�^C�!���VX�*Y�(:E���m1�rD�l�o������0���s��V�pEN�>v�[���������C/��0�z�[�h�V!g8��eJa�y���@��*_�h����#~�����K���q�l�f�;?���>��b�
I��Y�U:�E{�{V�,�1�4�7�M����`�,�d��B����u�������������"������pXA�������j�	;yU���@����)�A� V��g��H1v�;d�L�,�Z�������vQ���3�*o�q����(U|y�����/e>����n�������C����y�U(?��Z�!-�����9IS����^`B�yM��OK.�*���(�*ZKR%�c�;��f���(x�$�Li�~����a��5E���1G��R�QG�@�"f(���:CK���`
s�S�������	���u@������6�zzN�A����������
ZSUN�SH���Y��h����FJ��YJ������9�V����@�q�y�V�	������8�y
�Mi]��1pK��|g���L�G`'t��T
�J��n�>�iY��Y�>�;q�t��4L�
�{��X�X���������J�5a�oq2 M��t
������q�v���q��dZ8/~|z�V���t�����t>��'p�?��"=}br�r�ru�r?^��LW16W�����������h��G5���r�j�5N���5�;����������Z�t�G{����B�W�5�R����� ?�d��*i��9B����� k.g������-�����cg�����#�:�M�o7dl	�-�(-�p;qlU���dlCw��{��"��O����N���Nv�������Y|�����>�y7��G�=uq�5�Y������~G��pQ�9�b���$P�|����S���d�X��I����N�V#������	�@����dJI\5Rs��T�1!p~2[q\��������:;������8���9[���$w�?�� +�=�Nb��W5���w����������@�+��T�\�jx���NA�	lXqs�O�<f8����8�v��@�#�
�^k�����,,%6����4������X(i����cE{��w�s�v��=�V�G����BG�[�A5�����w�!��;+��}�t��d5��=����ZU��1��I����j�0�M��'��JO��}�b�>�$KK��B��~N�^�����������,g���Y`�wS���P������2�X]�)bD���$�DV�95�@�����d�;B*KAZ���5{�.'���_[��|�"
�z|=k��M�����������(��)�U=��	St;�;��$��^����>?1r��s���)�v�:?��j���[�f��*{I��]
	���v�!m��4��3�F���q����:���qE:U7�B��^t;�Y�pju��n�m���=����3��eU3��d �����/a�a�_����3&+�Z�u$HZW?��u����w5��Q�J_O-1Q�]zf_�^g�C������z�<s��OV)��Y�q����{��d<>�����8\��?	G!i���S@6[Ub�M�qA��fe%V�vT�V =��5��J)���P�s%5��P��nX�:����b�u���gN�v�3T}_�=Fd��b,7cfLh�=������mU�p����Y�	��*�.�6�
�T{y�'GUT��>�����[���N�U���Ky��:=F�|4!epk��`gN��=��B�S�Kr��BY'��!X��Gm�'�D�C�����)�]M9o,x[LZn��V9�+���`�M^����p��3�����^��U���|%�#�"���]M]��N���8�z����X��yt2��(����*'����F�Mm5�������G��CO����I��S�y��Gfu�/�U�����}�u2��r}@hnQ[�U���o	��\q;�%�~A�=x��(t����m�b	e� �64#��~���1������h{����0B�d7��y���3���%9����[��C�������N�'����*B�6�"e���d���O�O��y8F{T�` [����]03�$��? �,�@��%��I�e��d�D{L�?��x'�~I'{ 4~��)�� ����^
�W�~��d2�:�����?�3��_pS&Z����	7=h"�6n�a���E�:O\F����h��������o}~���4���0�!�����������R����:��(��a�bF�O	z}u�����D��6������p*����������Q��S��r�d�B�o\���������/6Q�W"����q���n�=V��ABc�~��C.��sT�q8%�g�c�NOA���\O<����xJ�h?y��(���b������8D����P\�%V��h�?u>'�C/���MT��w�s��N"�^��+7$�]���O�,mS����,������������,<un�Y1R;����/��(N]�BGP���]��K�l=��K�<��R+Wl!��Q+km!��{���.���d�����.N� �_�w�$y�r�,�����=y���6���b�-p��<�[�rFQq�����)���/ ���h�e�0�z6K����
�K���0,�z~����:�jWe�)\/�����U�I�N=������S7�O�<�?HS�'7���:��m<je�u
k��s%^��c�D����Sf���Y ��L#��y��w7n�Sz^�O�����_W��t��x��?������uY���&�P>���x�N'�)���!J�K�x*3j����0���>Y��5dp�&��
�Ih��P��	�MC��8y�NVj'w��Tj7���kRp{����VlA����V#`v��*�9�����&��%MOm7��2��)@���OAL��j��pgT��Wd�Q=����z��]��z��#����-����F�8}E�C�+���7�S���m���3���o{Rz	��%Cz<-��Krq������t�xs���������u����h���1�
o{�z�E�"�Kx.�"����D������6�0 �p)����T�����0�E���Sg����H&�C�@���|d�/2����gB�x�v��w�������rgHW�nl�7��������J��F��y��e��<�xBI�����L�/�A}4��D���dr�3�z�����>z����e��������JY����==�Ge`Wj/�J��G��}���)Z7�D�<AL�.	@8����Hx2�k���%=7T����+�"u8��6&��$2��{>���?��}y	J��*6P���������cY�Aj�c��Az�/������>y����Sp�A��N���;���.�v �����Ec����N'����
�`����[#����vX�<'okk���H4�u?g(�	��������b,�OQI�����z�F"�D�,YaO�Ky�S�a�Q�z$����()fK,����+u���q,�O����Y�>�{*w$x�=��t3���X��N�q���A���>����v�X��F���\*��W�M�V��m����)`�r�
�[��o��6p�*}���E������w�`��[9�X�4q��\���N-�?��?]�*]L�h2+N��&������8k������o�#�(A�9�����a������0�Si��f���5��k�����U\3<�����!;Z�\��!o�J���F�X$����4����ir����b$S��H�v�_���,�)����1�G�
������Y����i3�*�r&
�mn��E��7��oy���5?� Q|�7�rk6�����*Q���H33</�^!�����B���;�=��9�<miK�m�v	e������D��k�f���}�E:
,:��Y�-
D��(�h�h�3l���8�<���'�������4_��u�\�����/��-�P���v!G���"��,9l�2���_?���G�5|NN��VR��Z,����}�IQV���XS65#�ba|�(�5��� ��v����O��L��=;�������n;��]�:�����/,b�FSG��{Qs���ep��OT�\�#V��d�������$_��s���W���uG��A��5���F=��=�?e�����ZL�>/�Jmq��o7�����F&/���H���O��|���R����W{�>���Q�zA	�F��kX�9���U��E0=Y��6�WV8�(J����������G}+X�����XK��i,h���X��P��[OR����a1�'F��&O���g����Xzg=��z��&�jE�!j�+���OY�9�y���"myrUs�����3~���G1�Y����[f@c�B���)+�����qDKj���'y�5���-�@�	�������b�a}C�k���w���}s�g�-������3jD�i
�4fV}Wq,8�~�z�R����b�3���{.&��X�s(��T�Q�C�������h�$����&v$�� y����F8��4��P[�gm��.��#O�=d�%��"�Bf�4���U�����<�����T�
�
4~v���4iP@���n������:�9Z������j�.��r��yjPY3�i����JT�:���ruC�N�~tDa������-n�@�����U9m�9�9������\�o��/�!���9���+�
vK��f�X�o���dL����,���w�KW���[��n&k�o�1�:l|��w}W�`��_���ek�YVK�wM��+i��9{7�4t�BSjw�������8�������]�9WV�7��7�#�G@������@��)��h��3#3=G�����M�c-�����_C[���:F���6/�,�
�r�}���i��GH:4��py
e�gccdn`��`f��=_|O������r�?�N�d��o����/J�����!��zi��8���3����S�;[�GB s}#[+#��5+�����[+������.�� ��
�����2����.�������J���������������p��9��I��p��������_>����
Y������|�����_����Y3�0�1����_�����p���G��	�B�a2|����2�3s#}=k&+]���I&��
�z>���3���� �������?�v�?)���m���tD�k���(�>"���~�AV6��FfF6L ]��w�`~88���O8�Y��,,�O����!<ace������>e}�����q���4���C��g�?�����?��F�&���������uQ��������2���2�t����@z���w�f0u�������6����������U����(�B���!���,p�~�o�����������������&$$|��1##���Oyxx��������������}kjjjmm���������������QPPPQQ���������?>>>??���������3�����H��p��^
�;�mh���=�)�����w"�p�l��������}D�"��A�|�u����/��|4��
Z��w(_Y���4HG�ssE�
�Z���T��+���^$���������a�T����$���|����[*�=*���{��8��B#?�0�)��+��`�l�
��3��b�5k�3��n{���4��� ���-;���cjE�\����+���!��6V���{���q����t	�1F9����RA���K�P���`_���b���L���Z�YFX�W�����(���C��27�#
K^�2�����T�t1IU\��T�5�1�����S@��l��a2��p���
����CB��._9��<��	�����������r�SN���u�����T|���QS_�������1PQ��N�Q�^������TY�ND����!�e�=q��H�MHG���9_}��m�y���Jd<D�L�S�������o/T%�������~+��)��G&p�an�������r#u�W�-���Hz#�6����iQ��H[��,�P]e{���G�����qY�Y��a�������J]E�U��2��N��i
*L�8)U��A�@r�5S�4�o�.v��5�0n�h���rJJsUr�5%l�oY"@1(�>�t,��bZ	��|B������YV�T�"�y;I���M
�~�D*��W���Eb�����6b�������1����L]��fjX"���4�=}�fe1��>"qF������9�Z�MQ��/�R��B1;}�p�(����A�T�|&���=o���+P#����������{��#12��*��01o��=7�{�8��������numM}"��|O=f�ys�ht�x`q{6`6j	��=��}^y���f
�Z_��S_4���a7f�w��t�����B�[B�M����Juj[=�
�B<{MrX�����������~������=R�a����j=#�{�P����P���k�0�:o5����)�J#�%U���t|����&��N�Q�W�"����h=����t?�4�k��������S���h�r��(��x����a��/0�v�jKv}�0����/��W���7`~ 7@�!��������@(������w0	X���G���U����t�s�5,�Z���+B���`���i��J��(B3�B���:r��������hA�!��4�����nn9����:�g�E�/��#�JFd��s���{���;"�oN��o������<2Kc��G�J)�:H�{�{I��DpE���&U����-=&c�������/��`�wO;n}'���������#J�ji��B��~��P�{G��O��wnG���q`��BD�����[weC��?�2g��D)�uY�t~�b
kU�����/-*�����k�JLD!��J�A(���h��o������,!(]�#x��7����;���21F��8v&�� [���B��Z��!A�7��lU,����A�g*���%���H�B}���=��v�c�x��(W��v�/KJm&�r��$��D�.�"[���;S7]t\Y���PG�B��6��l?���������"�d����?���N?�D.���<����xo���0���}�����Q��QA�mS�<�i������-�X��R{��#����(��O���X���z���tqBG4>j��I��B�i��Ox�����a����o_;������o
��^�H����y�g:��C��W�>,	,F�[u5-�D� �bU�D�2<����$�3$���\K+c���a���c�������s��I��
H���cB"IR�O�^6�XWa9�:�
�p4W����'��Z���#��|c��	P������t�w����-�����y�
���ab��OcHh<�V]�����4s�p�<D��1!�@Tuz0���~P�;J��]D��{�B��"OQ�>���>��e��y�I���#�d1&�j���\U�����2���e����
�E|`���$��o��DnW��<�7+q�{3���eI#�����9�w�y��utE����o�r{K���T6{���{1�$�����R}��]r�1�;O#�5���G
IV��}�:�l��m�����9�1��]�m�1����V����������V����?-.S-t*��KD'1�.��X�m�l�3`bqG���E�'�+}/�N������i���<
�}��9!�|T3�\�
ZeW��X
��qnkM�@��+iL� 7{$��cL4���d;��^9y���~�J�����+"���������T�U��_,�\���)��A��a�3I��0�����~/m�����;�1���e���74��n��IXU���Pe��x8�������7K�������!viv���sz\cc����O��n�[�:U�@V����s�E�f��*#�7
��T������K���D���A<���Y-��D@�:�����`���~�����,���k�B�[����v���[Wg�
�Y��T\ *.�Nu@P��\�����e�����~�b[J�i�1t��w�[w�v�}��`��}�i�=������/�R=f�Z`��:�f���lx��c�H}oCX}�a�IH�����>��C�'(%�h����tYiy~�[~-�;m-������f�ZS�[�����R�v~-�=��T�R�?V��&��`��8����#m���c��c�����)��U��(���9��}=�?����YQr�$�o�s�����;i�3����`~��t�������c'���
Gk6�SW9"�����~C��+z8�����q@3�u��s���_��P�� |��\��7+H����1�c�=0��^��g��b@�
���{�"sX"��X�}4��ob���il8u%<�E��t��\���/'���&_�T��g:U��YX�w;��S��>�z/��}9��{N�w��iv�Y����oo{D�����E��t��D��V-�U��$���~���n�AB����W��/����\	��.�+K��������6�.��<�.5~����\G��o�N�e�1���E��]+��M��\��y��9�7er�g	��0��P���2'���n��U�e�-�"j�H�~����4a	N�p�91N�)u[������,�T,Q�e�UK|Y��'z~G.rU���� -S�L�=�
9��q��(��IU|y�����<p����+�5v�Af�������gZ'�Z�M�v�����"��_���l�}��/�����H��������*����"��h��#�88foL���T�+����n�Vv���"Q.��Q�����T�(����G�Z�\��G7�K�
GT<�c$��-B>��hK}O.���\��4�����3����|�W�d!���?C�Y��|����(��c��:T���Mw
i�����r0^i!%�:�`�����#,Y�r9��G��%q��
k:���I
5m�	�W~��������) ��mB��?e/B���������!������������:K1�O�L��Ot�L�]��Vel�~R��5��	��+�HK�Sg	�)��D)`,(E�faf��,����6P�dE%p�=C��h�~���+N7�D�"D�_��1�<��4�����I���{;76h��"��%�q�^�S�����e
M��d2�*{�d����v��bOoWau�����H��<��L
���.$��R/s��j6UKV+�Z�g��n��Y6�:_{���J7�2�oK��U-e�v����|y]|��.�s��n���H.���K��SW��&jJt����rd��-@�7?��.��������7��]�4X7*��"����N1���x�*{���/v���L<���|�~�?G��A��z�f�%�v�����R��Q���L�D�^���1��X��)Ly^�y�SD�dC����c�����"D�v���nt���Ff�AJ����>�t@������b��l!~N�if`k��G���t���+ndW�U�l|��4_�]���� �s����l��E�]c~�N�X�&1��?ts4�����~�C�I�#����e��F��8u�U��t�w'F�	�,)M��DM	�r�f1frs�%�����W�qe���'�988���=�Y����1�����~�%���Ym��Z�:!!4��?�D����2�7�c4��7G������z�EF���v���3L��G:�43}��M���_�D
�Y?�eb�F��y �h���!��?y,��]�`=����z_�]��g�L�Ub��A9����*�ENR��UUN<����]'�zc�i�������H7�����8f�#�)31^���UU��j-m�����Pg�hK��8�HYn����r����l�_�9G���\��y1�����"�aoq���."�8��~������M";E����j��TG��W5���?t)�c���vy���.����.��$��'�l���HFZ0��SN��A���^�9���V�9\�Mb�~����9��R~/A[P���������ii"���+�����B��y	����X;u��6Ye�����S�V��h�*���R��V��xX�q���
OQ��Fo�,)U-�K�IR���+&��9�
�'m��s�ID��u����������.���������L	�S���Z.���`2�X�E�����
�v��b+����PC/<�^D�,���vT�X6�US�z.�Tg�l��"�I����2}���b�����?���RJ������������fE^�R�������G��y-k�C����r����E�Hmo�-'��P���h�����\��
XQLX�e�m���,�����a|�,������]������M8����[�����9C������5��9k����
���@�����P�(h���t�Vyw����P��P1a��o-���#����v��������8��HFJ����#��(�A����yZ!���X{���8}��q��$%R�E��I��v�����*���,�����YL��!�!S�#���X�K����P��]28�uD�T��0��l+���V_ ��LK��:SK6�Y���On/��� 6�$!6u��Or���~	���k����}��!�`�y@n��W�wk�g�nwl��m��7mnU���I�!I��9 0�`����^���3��/d��G��E���k������������m^?�L%��U&�G��+���$H���w%���������Ai��[�t�jS���Z��	���a�����u��:�x��!`�`��'�c������U5�{��v�/�D��:��������ytl�Y�D������*$���U%�z�!��<&g�O;��^������E�hw��K/�[��/�m��
|/����=*���3<~�Oy�o�������w���I"�����:4���A.u(Gg�p�XF����Hd�H�2�.�����/��4OE��S�������	Bj�(����M���z���E���-��Y����q������8eR��l+����=��S%|�}��t�"��gy��r���������7O��F��t����f5�
����SQb�^~b����b]�;�:�^��([�W���_�<P	?���	�������W�#C���h��`SC�����,uG��{���=����&���������!�<�\Z�f�\JxPm�^��� r`�kB8	"	U���n�����@��-R�B����c'����'T�w���tH�9o1�V�Y��������'��j�%T����<�h_���P��1�*pY�������_[3�r�V#�����t2hk�3^[R@����v���|��S?�S�`��g
�S�_��8��6P��B���#�������}]��~b!�>�����d��b?�*B�r�K��K%�>FB@������o��"M�'�9�4��~��)=�P�gK��Q�y��B"_WcV��"���L��[FJp��B��\��V[�5��f*���C�re#4�4���8-��:��/��B]!}��8?*o�f� K)6�+�n�s�����P�]�~/E���3�hu�S�JI����Cfw�}���\|���?Y�9��"��>�=�&�x54V��!�C�'x�7��i9q�������z���
�3��R�5��]�6�z
��� F/4�uE
d��Q�������0����'M���&;Z���v�(Y����\�1�1��s�����
�����Q���bL�� ����G�!��6-]=�V�x�K3����6��LR
%��������|�����`��9{b���O�b���g��fK�Ob�[S����.��>�l�)	�n�������R=����j��g�����%�������f��%�)����]��F��\c�|g��1_�;�h5��z?����X�M�D��I�T�-Q�����O78vb����V����������(��g�/��416%��;���Mj���~� �+U�uh�M�C�JV|�^�iz�
�$��{r)������g�m�{u�s=%���8��B��t��x���Ay�RT���qv�"�D�  e��L��x�cv�3�T
�'MVy��=K?������P�����tRG�B�1��D�;��a��*�����[$��I�u7�����w�+��8������F�a�%���t_�7�<��;����
$���D\��/]%<��4=�u�������hI���Q��s�9��������!�e�Fl�������-������ofJz4��o�b��E{-���.v��W��y��A�J���������r���*!���O�%�s�}��I�A�����������>�h�w��)a�T�E���]S��
]7�X?~���;����F�sn��u���Y��x!�ND�����2�e�*!���UBp�&�	E������sd��"���Al��g>B��PdC1��{���F���nI�k9��5�,�����q��u�gUoy�+���yv�V����TS��w�-W��o�Iy ��u9���1f�_6oAHFx���0�6#�1\v��Q�]&�N+}'�{MiD��\�W�h�0�7����|/�����X�8�F��[=C��s�E!m�Y������"�<�4�z����M��)^�FQ>����z��+���c��cj��(�I�A��y,���V���c��e�6�������pW{�l,[<*�P3&�s���X'@Ni5W����c^`"�A4��������xf�[m��5BOih^��cHi���~�H|�l�L�V�X������H��^u��E��8��.�q3$-�,��F��N'
]Y��T��w]QG��
���j*(����_�a7�a�SF���`�m��=^)���4����:�9�7�x�v��k��Gg�(0����=������&�qu��a�t����~��XDV�2SK��/1���,���
��Dm4b���:��#=@~���a��I��)�:� ���4�V�!}�Z����1��.I�e����D��'F����j��i|LHD�����b������v�������U��C�9^�(;39��h=�6����Hl&��h�pk��%x����B1�������
��y��x�n��gB�^�����4M�k���u�y��m&/�Vd��Z9q����h<|6CP3������\Ze��q�k��N�����d��J���B������%7��
�#?tV�V�V�P��K�H������'�����T�3,|LS$��������J�@8EX�����BZ�7�p,-P�k���(C,�0���z�-F7����f�#)�'�HB�����W����c���<u��|AsF�|����C��
ds���]�M��)�����>^����o���tD~�l�����Eky�$��Ri�<�'������
�e-(^�C�r�$8Mn��l}����"��E�i]��.S�.�*�������~�A2�e���L�c��
�Z��[X�q^��z) ��_��F K0�Jb~W?p�I]��������������gT~/�q���P'NV�k��W[�7������i�Mo�DuY�"��m�w�q/�9uG�=��-pb��=K��-w�_�`�b�s����'��y]���E�_z�[�����MX����$�����_���y9�u���^�5���k���=/|�Z�������-��z�#�����h����j�����w#�����e�W���E���Fm�������?}GK����h�����,i�{l���,�d��+)������K+�##'�iOA<�C��t%y��3�W���_n��~~a���B��=���$c����#�!��$w���;:��U>���?��*������\J��-tO�����\�vU�W<��6�����O����3U���^�.!�N���l��v���f��^��q<_>c�9'��)�{i��;����r�.��hLs�+����V��_k�����X�N.����~�G���7B:�R�+�G��f�FZ��^�{�[]����Sz�Zo14���h?0�7���:J����iH�p��1���MsZYn���^��]!��r1v=2���|W6�n��MRE�=$;�p�6�v|�B���������(�G�����7���b�.��v�����o�������}�?-A��J����I���|;�����*k���pM$u���"K�\�o���g���D����w���r>a����wz�\�j"ag�wc���I������s_����ZF����IIkb���#����/�S��9���{���:[[I����E�<��"F�P�|��<>�0w����vP�b�<s��IAv���F�N�.��m����}{��JGw���������%�9ZU�;�=gq�P��a��}��q�l�����EHN��������^�0�q�fp~��8� �;u�;���C��j��w�vO!%�55n�G�5�g�!����`H��3�����������i������\��}�,��~F����{�����n��E��|M��������teEi�qW���ZF��N����T5���_��H��k�S����Y��`���ds9j�}u��c8��|���/.�bh)1�qyrw�[� :J++�:�����b|+�yV?P�3�|I)'������lw^)v)L@c�5���l_���O��h0����iCZ��4��l�,�,��,��&�|*u���l5���z_���@b�	dxx8�l���rt�A2���\��{��b�!3�RI�V�����������q��Q(�������A��2g�����v�W��'���P%����-�����������%�O�V�+���z:���W��kJ�\�]��G�G�-�nj+{��+]k9�y�)�9s4O�H��tF��������k��7������vg��\���Am�I�����O-N������u���'����r��gJZ�J�kk�3�n'�9��;��qS�q�������=��%x^U�r�]��j������j����h���h�p����������g��.y��{�������R�C�VF���0��������n���:8�.vo����jv?���?������e��L���>]kRsN!h��ek�o����{��Of���� '{d��c8:������P-�)����o��,��\�7��Ow�+��Os�[i�V@[ �i��'����J%��������������i�����������&�>��
�jh\������RF����B/��N���QSO��������)w�V�������#���>���jx�����F���n���Y��{g�R�V
t�*0��y��'�5��[=s�%�������wlDPl_�?��c��d��F�*%DP�����d���r6���+t$Z'?�4#����>�	Mq�'H��.���)W�����Bm�@.S������������~����(
�YX�K&d���5'K�G�\�R�O�|44O�Y�[�fB9k���'g�G���x9q�G9 �[�Ih�i�t����2��������{���l�Z��������4Z���+4�l������� �>���`g����qwp�,���MOke�ko\�|����^�kY��u��$��I�>Lv�cOv�*����u�������]�_��#;=+����W�����M�M��L33���9�m
;?:�q�*o9�i�mT����8�[�1�����oe��O�9s�Bk6VN��@����'Q������a��u�����
M����Q<���]���c?����Z���i��i�'��Z�iK��V�.��3SM���!��u���'.��fk���������0\���W8k<���\�J��S�D�O����������Zj������m�������b�R����
 E�_n�L�������X��g+AdZ���vf��6��3� �i�*������/����~�J�z�y5�����C��A5�4�)[�Cs���|7�N;	�L\]'�U���|�9�Vs�v����k��EU�<��=G����l�zo&��]���������w��d�e�����N��_8|[���=%S�g��R�'�\��<4e-J�%���;yb�b��/e������.��wYF1����G�C��+�?��zWk0��!|��{����S�q}��.��\�!�q���(w���g��Y9	'���d�-�J�E.���;������E{Aq���G���x99n.����S`qR?��������Q�4�xK@`�}��g��0��Y\U���j���A+�[e��{P��qj0�9L^^ Skps��>v"h���:"S+���4)�'����r|T�`o���z*V�~����ed��.y����X�uZh������Q�[S}����sc�w}\#p����q����7��^�u?�c&;���V����i���{d�y������[�N���lj�;Q\�V<r�����nV�s��d-9e�����A��G1��[�g]������S������
��)|��j�����I���S����� �������\���N��3����t�k�gkJ`r�x�$pv��E}�Z3^[��������$�#}|wJ�dw��d����D}Z]�]Z�4��R�vk����_�8�����n�$�?��g�|:����Q�u�U������g}YT�����t6K.����������]�)�+dEL+R)q>���x��\��
h�^G
�~��Kr�)'v��������m��e�{|���OH�-;HM:�+���T��]���w���4��v��n���6��m���M"�����U��6$�r����'L
u�Vf2^
�[��
^�;u�[�)��4s����:t�%b�	L
��<k����c�~���Z`�w��Nj���!�*��`��F���������9�F���M���c"����������Ew`2���[���Ns�g �z7��
k<�;L-)1��=�cs1�.%|4p��{
������B5c)w���������b/B�]0}�e>E�e;^��������%�2�n�\�nQ�����,�'�
U|�RU�'a>O�m�"��fn0�8mL��qOz2�p�������������q����S����Vo;����������&��e�.�\|d�(~��B��}Yl���Y8�������^r�#��=�C�v���[*��G�2~�G���	 7k���+Qp��#�q#�v�mZ�_�%�7�=zr�:n���>��L f�������?�tKg�v����Ob�zd���F�i��?Jv��V�����<wR���t%sw����u%���y��%���I<����i`��w�e/`:��%�����X��ib������C����M��rY��<���	��*�s8+Y�3�P���K�����V���������})�+�]��n�<DA������89�w{�������B�������h"�=� ��-�Q�sN�k(P�**0h� �g�O0_���R#�{+LA�������Z?M��`��]���\�Z�p2)���T���53+-�����K��������qT�D'��`���K��w(V�W�h�e]J���wr�D	'i���B�����S1UJ��
h��u�����;��Y�[�����>��z{|`��)���V�������Y���66��O�������TMJ����A��<���q+v�/�� ^d����.�t�0 ���3��{F��lq����f�3VL})����^�)��5g�����iL	K�h.x��
�ye/3d���Xc�I�{ �A�`�����4����/�8���b�K��~[�K�7���|�������
���I�o��AN��
����{�_����)�:�]�B?0��d��'�7(;L��w�WM@�Vl�O�A�|0�;�-����*0������L��Q!3���n�R���]p��-����7����8�[xBO
�YQ���NG���������������oY���R����<�L��� ���h�$�by�(�����
����(}����H�h��C�}a��'�����Y��]�(`����x �~cs�;�.�.�af{�.����
~B~Rf��q���g'���Z��nC?�
������{�9g�y-*�4wZ�����>[���:����j��$��fF:]�F�����$i~��|r�����T����B��hm�l���r��,����r|.��������A��{>��z�_G?��H�B��vY��~����sx������@��i4xY8�D�;�"�u��x�/U�E��P������`2��ASWx�.�����Rv�6���`W��U���.�g��eG�q�y���<���@���Q�����
��94����H%?�=li&��[��g��E�J����)���1TF��'��o��t`1/{�Rl�E�����(����N�~����������~��9/<�MF)o�`�d������Z>����������!���I..��V��/���QhqH�05����9���<�y��3�N;�Bhd�����#�Q��X��L���|���IS��V�B�F��F�-��5�����|�Gw'O��������e�(�P&	����L���d�����0$i���R)�#}�<���D�)����;���l*�6u��T8T&T%��Zw]](��t/6;�D�W��t����3�����v5K�g4�]6;���g�a�����<J������Z�5�N�p���Tq�e��8�S�#%���Fk�������]���u����~���>�pd�m&|"�-���T�4��&����b(F/��g6~Vo������.8b���'�
� �����������S%��)����8�f4��;����������$�5#)�������^�f]�7`�@y�6��z��$)`���x�AOl����V�4�������S*�'�J��g���Cxfc�W(ZV�����v���t(;�S���P��� `�R�R�k�Jn�����Ms�3v(L3��ds��o/;����{�������j���&�@����h�w<���%��	�*q�q'Up���rO�&�-n:��4�q�:�����(]�q���	������a���*��R!�F�5�J�Lp�1�x��?�����qvIJu�e�7��E*���S|�?��K���.y���V���eM�F�EV:p(��m	���2�/c++CU���C�q���H��?D������������f���%�EXJ�t�Fp�?�3���U��gND�x1�����`�k�O�,�I_^:0��gSiD:���(�<�q/~#0�C��\���]�'K��UM���b�;�������/���_������
g��Y�n�s ��{�5&�tX���#�����@%Wi�����s`uV���
[���r�6�����x+�W&%e��v�-�J�6r��]$��[����/`?w�)���8e���L���	#=���P�i��,^��,��$�S��9c�H��\�`r����+!^�b����Br�[���m�?+I���/N��*�66��@�0�T�Cbk���GVd-���mz�v�&D�f�7���h�]+]��c�b�l1y=y���X
�1����z�&s-����,��6uv���������npzX|U�1kt������;��N�����-d��NV[�V��^q�2����&;L�_�8T�*���v�RG\���
v���Y������	�!�UX�mAL�_�����	���q�2A���h�p	����x<+l~%9��9��VX�	�A<F��H����H
����n����9��M���~!y@{����jm����w;����5�����T_�nd�On��@�mQ�No�r����K5HE�����Z�!s,��J%&�T�����3��3�����K�K���/�_���L$��`
����:������-aCv���\�x.E=J��t��x ����v��2���nYe���6�����{�3h�g��5�+�=���_��L�zZ%	��-�%H�,����L7	4���Y�f������
����kHjv5�o#���$������K�3Q�?���E��<�ms|���g����e��R���g3�R+ncc�"�7B6�e�]?�kE5���	i�������rkI<����m|��
�
�T�?p�m��5E�dU����S�3"�Z�%q�L�l�������Ok�
���-Rjw��t�4+9�������w/�JX*P�c?!g�kX�����7�udY���f����1��_. �����
�<k�BOr�)I��z���md%O�P]�U1N��2��� X%����ty���2�|c����Kd������~��<s3�*�����2�1��Z�j{h�p�6�<L��n��T����sM��T�,G�2����{�3���d����>��z�kW{�]����`<�?v�������+$O����g�~��u`Pa��B�u�����
����8p�q����U�+&XL>�����7W(�XV���k�p����!u����
�I��`������~�w �bY�*��fE��U��by������3�9��|o��^�R����e���^�����f^�?u�����#����gk��~�zc�����ip]�5#�q�|'V
yf
�����e�.�m��;h��
>��U���
��q�	��8��.��m�g\��<#
��T��/��efA1)Jq��/x�����0:~� 5X������I�1,p�J^�gR�Ee^=q=��(}]�u@��������r�T��S%;�K����1�5U|��2����r��m'�������x�jR{�z��h{0V���'�A���������/�PoH-DP��W��6�\L�9H-]���j�-�R�����k8�[2�C^r0����������cub�����h1\�f�X&[%`�w���������������Y�j���|<��j��|��G#%'�#`&yMO��/�j�������3*;���B~8*������)���u/-���Q9����fk�L��AeK�=��lt�"��7
8�M�l�-&�0�����I��q��^;x}��q)A��r����6sc���T���f��(�����on��p#�V�������C�����i����P��
�F?���1Y%JD&�TQ����@�G
���o>�2�*��7�����WBa~���
�BX�
O�����E}Q�L+���,��`�����������
g��A�4,�!���!�h+-�
��Z\<irA�W`��`�V�D�~X+:��f�����L���dd�Y��Tp�V�X=��K�71��-[�/&I~�
����~����q�����2.�ZE;�s���;����6��[�����x��<�a�X��������V������B��[����d}bP�	����i��Zj�"x��o�r3��G��zKN�)��08�_���L�r���Z����-l����z#����d�'kz����������23�|��O���]���5Qn�h]sCw��jh=���.� ���h���T>���%�^M�E����s�M�}����
��E��!��wR���m���i��]/l/��JW&s���+T
�-�V��^JT�v�.�u�G�-.��t8�[y�;�yl�'m6Q%�~��Nk%C��lxAD6Zo��FG��c����MR::�������Wal���dv�|\�������������Q1��|#��B��-���Z������w_�/��tv^��/�.�&���QI�����G���_��*���H���^o�L��kM�����<.<���
`K���`�_(��/��`�K��XDH{�fY�wg�1�2W��O�ps+lW��:�h����:�+�����!�{��Y��5��f�E�S]�}�ZJ��e�G�H���!�!Cg�������T���T�a|�����pE�1v{�7PM��#W ]vN�*�t���jW���9�;8�^�b�6v�MMqk�@���jpq�� ,B"����go�!�>���C��`fhH �Rj8��H�h�2F��kY��eE��
Ol6�b*��T6!��/������XB��J�%������k��7<�nd� � �O��$��&���l�}h|�(`���\1</���&9�>�pi���z_���Q�Rk��gw5�����}��6cx5����!�"�rk$���ue�,���,w	F������L��%W���^%;���Bj����������0�N7{�7��`�>(�|v��m-���L�����~��]s��Ht��bu�@B����g3�IT���vBo#'�t�2dp��_�F�M�b���%�� Y�P �%�Uu��BJ��H*�g%�H�NQ!�4�����H	H%)����'��U����,L0p���������a����>y	��n���/�{U�%���v�"�ZZ����1<�Ld������PP���#2�5�DpG{p��t�`/N�����kU�"�
�����������K�^#��?�����s\�`*�����g}�I�G������j03v60tD���X{�R���F����bB�R(/I�$��������&�j��{���&���d�U�X�5�d��B�B1-�{�o�<6�i�7�3"���1�Y�������`PBi������*_*���
*�Q�Du%���H�����U����OS������u��}"��[A���FlQ�r[	���1{3���/<�F����.k���S�0A�����~F833^�(��$��r�
5SU�����7�6���`a�E���p��g�4��T����3�����:Z�}�
�KN]~�(��"�>��6�K�����������5�-���(��E��T�����}�q�?����o��T
=������E0Z�b7����l�}�Lsk���rI�|G����A�DQ�L��j�&Z�b�fAe�����
#�l��gj�~��������w@g�*%�C.��.���Y����P�FR���#]�X�L�&a���~+��)��_��F�5n���|l���3��}PhJ��yp���wy�.�>Q���_��}�cR(�S<�00U������E|��l}`^G�gU���x�<������N��-8��Y�������r�w7�H�P������ra;:�)�,���bI�����S$��x.��h���`�y�h��.W��b�
���w�wQ'~3��=�YL;L_��>�\�$$�\��&�Y����>�iT����'��n���2�8,�(�~�,T������i���4��}/����$
F,(�
�c����h�w�g9$i^Ge���L��������"a��
������G���������a���!0w���Q�K�mH[���q����o����U����s�d�)9���/$/�*�oF���~'Y��+:�n2U���F�O�M<)P�
����
�H����L����1����% ����4����xWi�X�����J�	w2��I�X� 3��-��B7c�0K�0�2��*g�]��J�PXA�����S`�����h�5fvf��N��P�WvG'?F�����.���R�^>�������C���;L�O�o@������
l��	����m6DG����i*D��;�Lj��W���}`�nN��:�2$�b�_J�~������s_&��	6l�^�/��	.�k������R��i��&��7/{���&!�n{�]C�,B�	���::��\A�{�gs�7fd4'��/�6b�z��/�H6�y����1���q�;qC�U���������f)�����(��e�[�������L>Q]�\&�]���tue�I�A���������a�c���!��
�xs����O��J��������I��n����KOx��o���BU������"-[�{&a����K�@	��p�*�>�	�2&�����5pc�%*>a��(
�(������	n��m�So�S���[��"���5/2�lSf�j�a�|���]
J�9
��!������6���-�0�����x �C
Y��� u3	�����~6�z��x��}%�k
����Q�@���t�S	k'���K���C4��'�k�Y�F���+,F��J}���T�nj��Q�8��u��-�:1nQ.���;����@\���zT	��
�P]_e�M��l��o�}��Z�:�����0��=Is��`����0*u�5���)����t0�?��2��~�G��	��WJ���F���m.�
��?/[�����3�%���	c}����9=�� �<�B�-��
9���k����j��<���@�q��
�S���>����R
��������,�"]x�\��1L��G�b,�8o���O��b�F���������$�UO*�:����N��d�����hH��ki=�JL1����o�I���W����N��e���T�
�����Xu�%�q,���F�X) ��R`������i�c��43EA~>HM�Q����l���4�.��+9��"�Z��d~�+B�?U��X��uq�L�e)�3S��>B�����ZQ3�e����VI(��JZ*�U�p)+����c������P!=y~��7*reM�V�f���jWK^���\IS�G�����,�l-�U��t����*~G-u�Rf��.E��9!�&zY��S�	��JRvK�xN�Q�H�E��pi1~)�hxfrc�o�}3���T��XI�(�2���*�m$|����O���]���N���-����R��N����dp�4��������<�(������Y1M.��O�r8a�7��lQ4g�51c��qm��(�b�.o�8t��
����^�������!N���Uni"n��4aM��*��Qd��u{����r�
-�J��l�b[���f��eZ��|�@���u&A�kwc��r���)!���u����{b�����5��"n��t�QX#\��-oG}.��6�_j]~�1� �����L]O���_s���aC��w;�1��4��S���MV����]U��`�9Ix��H2����D�xO�ot���%:�K@
���5��D~<����x�uM��>�F����r�����"�1*�!	s���,w�
D�����XU���g�k��<
h��Z�V�����%~Q�*S`�%�)�'��X`�	+�N���O!F��d������	yj�M���D�k�����I ������V�dM����o���e�!�/�3A�����i���4�����UB�W�x����g�����b>\������R����L:{�q��5�'�K{���.�2Mt�\C2nal������l��C�M��xZ��5���}y%�������F�=��?�'���<0i�kM}t>��&��%���Z����i�4eF�m������L��oC����T�W��[ ):*Gi���A���u����N���1$�~���&��,r7�&��gqv,�g��E���'_���x���f�`�4��w�td�3�W;�����.���T����ru���������S�L�a2^���:�����S]y���@5�wM`?��8���j&�PFX r�,�55"a�t���tk)
�GM���LQs��un���������{�3���}��8�l�������J��m��b�-<}�����LO
�pK8,�_9���f�!�dA��������<�A����+����M�����=�X����0��(���]P����t���s)��W�'.�J����g�������zEy����J�����A�s3�M|�",e����J�
jD�Z��v��7�3a���jZ=\2Pv����Db���)4�-����EUm��0�������
�������0����_�q�F�	��h������6R��-��d��!]�Z������8)M����K�;��O�=97���\�2��=:~�kWR.7L%#~��s��h�Ote��e{�4E6�.G,��0�t��hr��/Y������6��r�GL��d�n��0�6_ur���b�@u��`��Q�M�^�".�����j�[����"{����VC�?~�L�x0�����uM�9���d0K������>?����!��R����e[)���[��)>�#�2��K�('�S��<���"�����-�-W|�!w������,F����_�i�4aVfw������)��-bt,����(�P�-��fG�D\
����9�����FXSV�]���9z�>R�����H�6UH�}
�e��Zl��S�ri�c���������VQ�@��b�o�����A�4\�P��#�(>��<�E�6Z�PN��k�_���!v\�����	��-����_�|���b��hk���c��6��@�
�F�����z)�3l*���|�6��(��=4m,'��'��'$���=~�aj�K���*�xZ���*���6����>���>|+"�S��s�����,Y�'&f�s.']�1"
�&�
�XU<;�F+?�_3������l��%�aa��x����/W��M�z�8	�_u�W�!!��*������Mp���	���l���=����6�|����w������~����v9@�z:�&��Hy)C�y���w��R���0�_b���<H��l�L��E�@L�����&��$�p���K4W�@9�|��>����4CT�k�_6�"����|�!�Fq�v��{�����e���k{
T�x�
������>�I�jS_��7�'�����O{2<_��sq[�td!o$v�����wf�)��L�7����P��CB�9LK�3���c�����r�[�U�!�M���6'����*Q�/�yT�Pzm��;ZVvcI�>,������(#�HB��{D�P���7��4[��]�u��G��O�	����i�8[x����h�[3Kir6��l��G�������)`����X/L�mveAp@����O���Q�FXI��;�?�����b��������`�Q*��H��8�n�"�#\��9��0�A������:���MRK.��%�+��+���k��*<����j�3��s_h���~��/W�8f4�7���u}Q�x7����&��c��������!�����T�SOGi�OEe��2�	��fJ�����D;��4���l�������������-��S��.E���A���}I�Q���\i �������`m��ei�o��r���`8�N6	����~>T���t�����P�*k�aw�t���C�B����2�>q�R3�3��[��>��`~)��[�$eg�e���#�{
������;���e���b�����#\����������!>�/ym
6��
���X���P$���5����5\tq������=��Y����1�}V���UA��������/1�W�k�0��sjO5��T6D~�#@��B��Ze�L�%�p�'��ijt�Hf�@[�	�����L�1�:��/���:��8����O��->'\��|�y
���*C��O���������AH�
�&	�9o>m��]���i�D�7����"`�z$�����-�^*Z��\����d����������5�����w�������d�k&R�b.����^�����N�/�cP��:��o+��Y�1i��X}~�Fh���M�I�gj��8�p4h��0�>H��&o����-����U�&�7k< g�n/�����<������o�A�����v�]�Sc;��N ���yv��wn��.,����
�������L�������q��������������}�Qq�k+Fd
	�-e���������!>��~�(����������� ��[|��E<��h�#GXx�������35�O���;v�l9uf��-p�_d�e���j��%`�28H{"��kuF>\�k�o�q��a
a���W:zK�������Q����)yDX���a{�N3?(���z����S��������u��Kn���/,=�E�
�`&o�D���hA�^O�!�\�kG,��\�s{�����0\-w�ZW�d�>V��������"�,�M���w�*v�2�i���yYF�S��dX�
�`���1�$�������6���l�Sq��*�����5=zqRH%^6�2����Ll����������>��R@�����!gj��a1�+g��g�Lk��7T��j�	.���S����]��=�#7b�����p�9��X��w`�,v��
Y&d3�z�tsK�tOt���.
.�P���G����c�O���Gd��:p�M1q�PgTn������{k���Q����jy0,o�#�n��`4�0�kp���A�;�7�����M�~���X��u����/S��ojPw��]�u
����@C�j]��L��c����<��&l�=p�.�/��16�*��/f�*^�00�F���4�Z�%���P����f�gU���U�)_ ��M���77L-��I�'���/��c����Gt���d��[F��s���<�����f�3&)��=�>@�9��������7=�J��!&WvH���o;�h�Po�L��X��)rt���uH���'3�ix46 %����E�2�Ev�
��{�
�����O�"��o��4����C�C���g	�E�`����Kef�"	j�f��D\�����?�-K�H(�ID�b`��*�8UU-��������������<��p� �R��2M?��,�uO.YU�Q�t�-�q�J�~%JW�c�`{����Z����Cf�/��{R7GG�t�x/�X��$;��.V���e^Yf�f���S�bk����4v�����2�18F�.������"k�Z7�"��GY��7�����W/(	�).�x�'�YXE��=�����vj�{�Su�X+/1g&C�O������
�Q��
��<�
�J!�;�B��&���
��mk��3�r��-�������'/�^W^yx>W�+]�������}�[���>J�zE�?y���i|��V��rp�+L���G����,O��EN��'>e���c�G{�$f]7����k@^B���n�/WwCR�z36�3n���vV�u\���M���<�~u��r���%�C��15���7�����G��6Lc/�%-�_��6�l�eM��FwE_�^�)`������t�Bn���a	��\3�z�M�i<����,��k�">1��(��L-�[�����H[�'�'��;�wQk��U�����S-D�-zUA�lU:���������������|��jd�3�6T��m�We�����$�������}�g����_�v��I��X�M�V!\`��������vS�>3\��E *��-���<�N���EZ�����$�N�z�(=c����'R<H�a[�Z��R�v�{��/1S$	`��
���M>y�{��g��1HKo��r���_�4*s9-����WSYOQ&�>�IN��}�I��P������P'-`�����(�S+Q���g��B%=�����C*&���[xBB���z�J��E�fd��� �e�U��b��
,!�7�������>���Q�d���b,�(��O<8��$���6��R_�e������������`�m]0)=9���I�;������`B��.��$�����4r������F4�
���
o�d���_L��r��p��
�w>�����t�����c|>���mW{rc��������pI�N���+��2M���o���wr�����=����@�|��1�tL�uJ<w*~���Ek�����R��pP|�B�%a��N���EdzT~��C�������r��_S�QD�G���D�_
v�j��3���*�%���)��V=|��e	0�50-�?�����v�#�q>^�[���P�rV�gd4��o1	����H�b������W3�0�_�uF���'Zy^������MZE�a�=���0���Yj��uO�f���V����[}�{�!�#�'���d��2�c�<���G�������������.G;�� {^P�
o�M���_��~J��$�I��{tNby��D����U�j�!n�A����Kkg��7����A���]����%�?�
@���e����?f43[+�����5�����XltXs\Zc7�xC:���'�:��/,�1	Q�[����r��/�_����#��s�lB~]��f���8���J[M��Z���v��KG�c�_����gDg7� �	�>�)L����5,�2���,�tGC���5�(����/b��<�cK^p����S��x���"��
���98����	=�Y;�}SK��n|6�������n�7�*$��`��V�w����g�x������G�k0�Z�o{������W�Z�1���+�p%v�7CJ���T���M%�������d���8�K�"��WD���w��������S��
����'��du��X��*kl���4��i/��<4"{��*���C�NF)��
������S��&��^�����:�C>���|Q�	?�D�����}i#�q��5�3�@6�H(�����!|-u}�`SH�l�I6��A�����A@��2`��C�b�.oDS`�Nj������*K�)��e/a�A��=����@L�C���
�C��	����b���.oo�x��G�x���HZ��H�u)f-y����'����}��j>���9C��r�O�S[f�i|��`}���pf���^������'���Z������)�����(4��$�7�jP.�������v!�-���F����Wn�I�2��r�C�8�:�� ��S�v[�����7������B����]u���M`��yH�&�u�a��ui0E�������P������	��c|�.T�)g���=X;�AA�Nq���8D�8�^��NU%H�WP]��i�/��]zkSF&#.&��I�B�5��g���t��\���he��'� =��l�F6��N��@T��%�������������9A�������>��-��<}���Q�K��o�����������^�@s���(�/�E����O<�&��u�2
o�9�k�0*��VeO��F���l�"�B�I����Z��wtq�N ���(��d���^cS���F�?mIz�����=��C�\|����>.��s��\z�����fNM��z)&ND�=�<���X��";�N����w�L/���S�c>�s1?���������O[<������I��k����=@Q��"kn���]�v�d��k�����O���AY��������BK��l��_��W:��a���t������n�a}�T�������e������%/mgpl�doj�`�����"�m�������*8���'#d���x��k�/o�h�	I�t��Md�����-$�r����{@z��?�
>,�����U���d�{o����:M��]����Sq�����?���:��=���
�����RKl����)&��������&��Z����2��11{p��l�-�l�HX\��&�����2\"���n�h��F��8;�p��F��=N��k'b#>Ig�����l���o��:���*����WD5_�����@tpU����Cb.4����
M����!�.���k��D�r���l�n������h��������,uj��0����c�A�uF5���O����s�o�*e���HN��R"�z�m���
����*�,��[(s
��x�eG�c�
�+J���B>-�]�Iq[��&L9�N�zU��F��k��X�c=Xa���
D�p���R��r0i�z��sqQfu�c5�-�x�s��|��Ai���O���
U@R��K�_Z������Ht�!M�P�q
�h���(�?b����6(���D�n���S�
��N<�er7�����|d��!;���H�R]��5bf"�c+���������+��5rg�'�/��{#|��vc�*1��H�}O��Gt��-�Z�h��
�.�����%����~��y�Oq���}�e#����$��e*�����{�������S��)<7R^�������k�J�A����#��?Q0�T��Ae�\=�NO���XX�#M�.������Z�`���}�
��<�����6H��+��� �a��a�4d�>����:��s�P��+#=�-kz�"9k�md_�rO� ���%k�|-Z����`��H�R�������fc�0����a��{���D�����D�u��7�_k��=4kW�����OI	 �~��t��?5<[���z�z��~�����efQG���1�eBY��M�.��U�}������������Q���nP�C�U�KZ�KR�[@@P�i6�)�!�Hww���x�����?�?8�s���?kf�Z3�^k����/F��������������F���!��h{���d�3�Xl���%s<���TC8�^/���{kM>�����'��;���cc����D����Ob]I��Q�_�����K��wyTsb���=�6���Q��@F<��[Y�o�����>�������=3���}�`^3�d�/X��H�m�E���H��J�F?�v��U�Q
������ �0}���m�P��(-P]���oh�Jx\�O��T���x'������>� �l�3�#h���|t�b������%���Z�?��B_��~��!����;��
S�FO��p�����t��f�#O����	'@~��^=��
Cd�Ja���w4���NDL]���������t����~%��Mo�gH�'4.tR��'x|��-�
��h��3���uk�?�l�2����?3s��*�����z�=Cc`s)�����[����+������Su)~�uJ���<�W��2��yO Oo�o�����n
y�I����������8d��Gi�J��Uy�;�"�~���u��+������K���n���(��2��(dh=�/���H-^�����+4:�u*�g�%���*����f��"�-����/x���O�{Y�%�T(-����(.>��x���V�_���&�}#?%��!��v�o�x��b�Zb`9/K
�30!g�A�/���)C+��j	�e��:]e4����
��I�����[���SS���M����b�e
���J��;�w:ft��A�������$E0�������j����@��:1��;�����/�+��o����
����~������h����f��h����P�G���w������rk����hN��<���88�'���z��h��=)�OR��s50�����3�~�o����
�s��2�P,6������&��[��:�r&-�Z1����������GT{�4�3H����<�&��u�2�s�[�d2�i������7��#�r��3y���#&��,p�7��������>Nye�F)�������`��ct��i���%�>0�E�GU�SC�_+���"?Q�����e��SwW%���k\0���`dB��k����F`���2Y�e��O��<�]�7������KXJ[�������\��?������l������Oh0��h	;3�����?}N��/V=��e������w��*�?�]���RV��DS���;�'|-r�"�{�)��@G����g���)���y�~Zr�rh�"
}2����O<�v�w|[��?V��c�K@1��e�������(����g��a���G��yS?���j���i-�`����J���3����<���`���7]f��j���m�2���^�:�,��j�������x.��Vh�*3�%&�a�����Ww/=�
�]�q���G����M������_<�2�s������������o�����2i��r_�8}��{��������=�\x2B�D��Dr����0M~��x�v�<��0��!��_���W�7]����v��%��1�.:�jH~�L��g'o#�<E�6r� 0�th�7?�z�]R���7���^p����1��+��/=���t��5��/�,�
����?���vS{�$r{3l�b������(n������Akj������a����'���u�����$��������_�ALo{��z��B{z�@q���cQ�����������]����������L��������i-G,�_GJv��?���x
�Qt^���2�!;��=7��=�=m����'���cX1�l.�
����8j���VoQ��9����{���&��L,�K�*��k��#��[��So+Vt�+]\��Tw��t<�"j��;��\�?1��T�"=���G������������m&'�����M�k�lrfy���-?"�� ����z�be��eaK�a�o)���,��*������>Ai58�
��@�o#���L�K���-�����a��u�p�i�����f�$a�h�`�C0xo �������F�����H�:{�/1G��]���l`@�����0e�/~����/���x��G���:��(����x��1c.�����#���w�	�r���������!	��;��g)q_�����(��F�B��0��)��I�>�L~`����ej�f�1���y��u�6��	`���8���k����m���t?������RR��v6+����,d���_��vW��. Y<�U~"��Z��<)'���-�Z�\I��}[��tJ-��2�O�3����A<����m�*�Zi>N�1(�%v���LN���������/{�"�W,��w����8m�W�6\��e��?�(�<py����v�A����S����p����_��K]�$��92�mL��k��#XLyk',\��x�_��H�Q���6��W�Q���3W$�V�C����	�I������g���k�7�9!�[�\��vuU���
!����%�*w��|�
��h��8#��v�|�fYBz��HWKe��<xN���q�Ot��o��Sb����m�jE���1o�pt�Q�L��n{�0��qX��b����I7����A�	U�`(�{���f^�9on���7����:i��'y���. �j�9�-���_��R9P��x���Js�������4�I�mp��t�����aH
�����Xv�3�A����z�2l����QN��e�|���]�h�����-3�v�<�F4�sIb���WA4�(�-%_���|�H�\�5r7���p�G��T�C����D��ObE&�'U��IH�����~�$�&��K����X\�u������tQ������N�h���NZR��j��_�z����$�}����{�"��$uX���Jd]p&�+)h����Nx��C.��8GB��|0���J���1���%\m���	�I�W��W����L��n�M���].a�
���<%���T�����0�&��n�f<
XB�$G�����
4Z�7g-d�|������y�nY��1���3��q��"�U�����IsAJ��h�����/E�����k���uUc��1�4���bG��[��M��~���sp�8*h�ZM������D�v�WR�� �Yk����Q�};m�����{��AS Y����tM"�dV�|}>��:��W���0�?��'aC�a��5aE�W6R��r�X��W����Y���T&������������u��w�px?�}�Z�/��4��+D���&��:@�"�c!���+}������ �6�^�8��%\�,}Un���	����7?Io/�m?5�V'J��>����u���D�
�cw�:�[�N�&���fJ�B���_W��A���*����I�f������3������Yf���O��*0^/W�u%�v������U|�F_v2V��LV1����
������+���]���8��D�x)5���b�vC������������?T7�YS;j�����
�+x�{�T�'�|5��f�$����`��V/�3���S�����.�?��)Hss��+�@S��$�q���!�B\�1�{��xly���_��T����d���G�'���~a�a��L0�r�o����U��o�������LQeAi�;�{�����6��������q��{E�]
����SSm}\A�q5�/c����4�j^cJ���j�w�w��w
��Ti�5<��IA��2��?_[�#�)���o���3���y��bQ��������fJ�w�h�y�6��B|L��"V�����o�t�m���X����2�O/r�?�����J���������xY�����;.)���me�q�{Fn�6��V��4{���Y�~��r��MI��l"FL��M��i[����W��a��u����n��f�������ttLN���l��]��p�O�lg{�1�?����~���&�������"�	qY�(�P�%
#��d��Q�����(����\%nV~`�+�\6�=�v���GjR����T��M��R�3�JQLqX{���'32�����F�U^�8�#�X�y[���n���
[X�0<h�I��#!#�~��C���Q�����.�N�R���Q�+���xB6������2EP��"A����Zk��;�1�7���E�3I������0����!2�bd���4�x��~&,����$L�W�B��������$~��� �&�%j�:/����J*p��v��6���n)�&�����8�Wf�����0���r
'3�H��J���� L�Y?��*��������R� ���
������VC��Mb��X&P}wd�{a���g������������Q��?w�i�W�k*����1Sz%�^���,��]M������fgQ����%bY�x�3���3���Az��~��*�+���^`_���d�`��yTa�x���I[����_����P�O){��cg�������)1�3x�W���L�w���{T%�����[9�)�5�����
��������M:�U���U�#�L�-�e�����%�����G����)�:�������o�5&��"��h������I�}Tg�>��)�c<[�o��(E�����o�U��#��{�����=�K��J��V���+�W�_�������N/O��\K5��h_�����>	]G�7~�vq����A����n�F�_[�~z"������-2�{����^�����M<������ u�|����p%�=~1����d�e���%���	��
�z���(�C�����
wL�P�
���N5N��|}/�_�t��U��E�������'�-`s��C�Hs��)���(����V_�_KY��-�R���O)�iss����J��+5��n��1wq)�~���a����z�q�����
c��)��$�gv�f���r����� �(�����bh�-��#"�2�o`����N����pt}6]��*��������.�n�$@���~EV_>��������K(0�5f��h�du�d��S����$���W$�� y�������57�� ���n��:�'6�.�/%�u�T�'DT��Mo��)&(S�H.����?d�jO�cq+���_�����%������)#�~x�
m�)1�eZWy�����_}��|��6[\��T���\��^E��v�	�������w���[o��M	M�}j�r�o���Z�p�90����3
bmQ]A��KbdG�k>#��&�b�����?`�~�]Y�6�����|e>���Q�:�	pH��90{�Wg������(�
��:��,�1f�%w
��Z����MJ~q��1�4��L������C�
+V������G���Z��^t�y�`i�u���n5����)���>C�u�*hu���v����Z��0D9�3��m�7!���Q�����t�����i
�����}�`��/��e����e����(/�0��W#�?�����f%��)�3Lw&���eS�����O����M�0'�;��p���������G���I�$�$gt�Y���y��7������"�T$uk��Y:LJ,�����"x�F�`h� w��3���!�(I=��j��~�L��yj	���1Zn"u������&5q�g�t0b������HH�S�T�sT�����/&>u�us%�u��g�$��Ii]r~#��H��7=)��M�,{oU�XU�3��������������l[
4�{��?��+����m��GdO��U���V�$�����08��m�;"�}q��
j��P�6���E��f,n��W���Sz�.	�f�#+������Q�s�$��'���{t��0�}����/��}��,�X~Se����T�p��@���~����B!�#%-��_�h�����9?xy?��i�X�}�b�NP$,�S���~N�����?����Jcf��b`�����ja� e��H�6��P�&
T�J��@i�T�R����f��!a�f,!�$�c
�$_�,&6����ev�^�������\�����=H��F��)����/�p/j0�N/�)	��]�����]@��?���[��='�#-&S�Px���.��!F�3eG������~V/D�nU@Tx~[X��wN�E�>M�>�K�C����S�F�
���iD��2��� �D�]|����zJ�����i�#$U!�r�*��t�r9�]=�+��Z���-�I ;."��
)��\�((��U1�k�����|��]g����pE�*j�{WH8\ �d�[�[g�����{Q4
����b�?@
��=ZI�j*�~�7�i|(g��x�]�����z�r���#'��"x�
��������'E�Sy�=z�?�d!g��h+x�w�e���0B�?E�,��^��J�p�o�������1��.2���C�eV��@����n�\��!x�-J�M#���vr�Q���G��1`��`?��k��i�0f��mV�kT�3
�
`�c0�/��H �=��Z���Z�i`�<�u�S�@�$���g��R�{�v����r���W�8�5����\}	�Pl�u"���������/<O#��-WL]n����^��y�Z�]�}�V��9@�w��0-��'n�G��~MYdf�.��r����B�g��sO"�Xoqfv!��	O��,���l�?<&�WO8�=I@�G�kF��Hh���m7��L��]������gM����|d�������t�\��,k5&����[�a���{�bIK���e��MB�N���o��(�MF�^>�:��;K#^#��z��^�;��WD~�*K�3(���u1��1W'W&��DH$��m��AN�8?e�}��~J$�g�Pf�h�[��[�H�����)�1R M0��y.�x����G+��dM�e	,I�e?��FA*D����;B��8-��x0�9��^�����^����cr�L����Wt!���(�-	�]�$�mL��=��P��}-�1�#{���:�g�GVw����Yu��f"������n)���=�E�g�/����1�j���*=S"M�G,���b[�ku�_� �_U�^D>{V���
��@�$k�!������q[\A��*
��q<�j��QyZ��H���8��,�����3Z4�f�8u���,�K�u�%*��y������u=�6g�����%�
�]����
�8NO�b~���!=�9���Y����?{L��w����b���� �|������9�I�$���-a�?<��?��� ��<.��s�A[�����40�����7�t� �`����-�4(��\��g�a=q�e�l����`&X ��mN�aA@q��qx�-����t*C����5��xX�����Yp��	�&k�5����G���>o!�^<\���� Z�U8�������02�����<�n[-�����0�0��;���1c�.�m�.�������������N�HQ�A
��\[��1��[��&P����3un"�{P@c�b�*P�NhHp��3�����
��,`7$�����6�D2�Y
$~�3S �!���� `l��>|]Tj��I0z�=|�4t��O@�Oh]���"�>/�(���
��^G���{���?*�I��-o�;�]��u]�>�AR�
��K�;�5�!`��#�������!�v�JCfz}��������=r�g��^��^��O�T�Aw�$���f�p;y�������TG�����m�C^-��Z���T�H�w^�����H-����cY��=�G�	��B���C�O`x�����5��6�����c0�80���?&0�xJ���t�&c�CU�%�d"�=��eN�����N�����������$��ssz��B!��(Z��m���8���oxE��1���4(]�;�"�ekS��,�@�s�<H��+r���>!�9��A����#��\7������*f2,M��-to[t�HeK<YQ>@�L^�5]	C�������n�^eq��\��������6h�2-�>�&E�����g�)@X���q^lF7��;~X��R�[��$����n�?fh��+i�, ������1I���LW�L��<�g�!�����hl�����[0�)0"���b��`6�+tM�������1�XyA�TW�`M���.`u�*}�]�����D%~Y(,��R�'a5�2���$��u��u�)['t���dxX���	�4�'�N�45	$
tU_`���0���4�;������C�%Y�������(�D��|E����A	1�(H"���XSj:`&YM���4�w��|��c�-Z����
�����pt�Z<���V|�n���{!��$��:���l5M0�<�/��.�@���%����x
�.��A�E6�dSB��v|H����G�:�:R� Ni~��N!c#�2Fj��[�C����U4q��M�>a-����|�'NYlYX��g��vFgIFX�U�D�����0�����@��ova��h��j�rh�����f���=����Z�r���8�LP������T����:�o'z/)E��Z�Q��)h	���A
w�l��{�	��`XPm`{)T�F�vMn���}����s0@�R�A��4��bH����)���L���}oI$l`�X:Z�5N
�t?Y���=������'&���;sK�i����G�x�c������m������������xg�^�[O��p�*�s4����������>XRDl}#X
|�!�#�����i�z�ph����z&�w���?�t"����l@�uy<���F���V�]L����@��B ������r�/�ra��[|�J���9F
�������2��5�d��L�l@�����3>��w ^&./�.	`��{�����z�8�B2���JSg���.����E�-��y�Lu���������jM�p�a�v���hw�|��w0�R_��&`mkm�9��A�b%el�P��/o�maz+j�B:
�$"�����d&,�I�D.�p���jF�1*x�0�����C��W�s+p}���_�xd��e������YY��"���U��T��x�.�5y���bs�V���&�.��mlM�l�P�r#!N�+1����z������;B���S�_����H�X��x����W� g[���t=[DF4�Y�?����E��5�N���!H/�����VQ������S`����%0�A��B*0���YX���e�g�bpE�����5I-�R�}�7���N�������|���e���;R�?f����������=~=������������${|O� >�n���/���b����_]�b<4W����7IEh���g�S��`�>t��RZn�3�~#	����KL��2�"��6W8����>o�p�}����a��y����X
�����F*�)�}��VkI��Hd:��4�� ww�,��v{M���<�&�|{�%!������������
���Q�#f�����|a�`���_p7twk28�k��xyB���g�P
���k9��_A�����1	+p�����Q�����*�����f����D����S�����U
����_o�q��$�	��m���������wk��H�����5=P3{~R�[d�"O�����9���R:�>�����.Gb������K��~��/"��Z���>t{zU����Q��|K�����cK��i��4bW���))-�I�nI�|OMwX1�~�`�zg�f �a����
W���Gx��?��M��U'&��h
���c0x�c��D���e���������%'H2@|�Ao��	��	��3X=z�+�j�Z��e���9J�1B!����R���������80l����$]
�D���;�c��������'D�������.eVt;[���?
��������x��V<v}T�?�L�
���ZA������f�g���G�����>���h����*�Y����{���B&����+���?K�=?�"��k���J�rjWs�OW����N���f,���ak�{�z'����z��\��t��u��������$��Q�5�������H�����*�n���B$�$��������[��[{�\����RY�l�
�I�vV[�<^�O9{@�����	dk���t�Rb ��L��r���`��Q)��g�Bwa:�?��8ml��v�],��<7qC����k�����%�ye�� ���������
���!�P��%�Ym�����H�O*�����T�&���
�Jh)��~E0 �hDMW(4����^�`��v0|{����{P�\3�~k3����))���6�*�d`��8g)�Bx��
�DG~yK:��(���0���T��M�]������#i�8D	��|H���r#�m3���hX����$��|c��!����rNB�^��
z���NqV����_0����,��|8F;�4/�E������4n1������������p�	�==�C
0�G�����B�$O�^�A��+t�oQ�G�chjL�]�q8��,���lWEN��a]n;b����$e+����n���]��H=M]����"���0���
�tt����]wR!�������{��	8�X����<l-����oL��6}��k�
��g�N���9b'�n<7�����9h������^���]����WH}�K�JNy'�`}��)����
x�G���A�8����D�t[\��|#�y/��v���x�d�O �e�\[czYQ�l�QV')�$@��M����13�nf��y)�9b7�����[��!���MZ�[�������6BC��R��OrX@��&=���oy;��>J���V#�&��9������M��iFD�k�����/Ig�c�$��.���/b��=��f[����k5$��1�	��q������%��4�[M,���6:{&�6oO{u����}�E�n~gK1u���(��!(x@�n�"{�>,J��7�h^��z��E�Vi,�A]x
 �/����2+Bi�#F�~���w��Z<L���w��K
�����}����(���Qi�� |&mYf2��[z��\�*�h>���%�����		
;m�����,Z�P��8)��!|�p��a���b��q1N�!`3/��:�SKUi�o���?��o�'9����NJ����\s2��)g�(����N�?���a��d���"��w~����&x�����4����*��_�5.)�U�l�������o�����������U�!�^� �������m�n��:gk.v��6v��M��k<ph,5�.���VX�\��
���I��(\��
����?��o+A����c���?$�K��7��I(�w]�������{�b����xb,��Q8"�H�\�)��f3k�?�G�'�/�y����m�G<���L��y�8�����F�B�/�s����w<m�����(���_�~�6��w-I�������M����ZX�Q���K����e���1I1/!�8���J�����U$kk~T�hj(^V:r�����B�b��b��b���8!*��-�u�X����3��@�>]��8��S��q\x$l��wq/�i�~/��?l�H���q^�=�9��X�u�GB�������/�pQ@�b�����G�������X7#������f{o9*-&��hk�h��1>e����A���;����j������h��W��@����A����3�#�00���}���+�	U�)�\��]����N7Wj�V��f�G�W�����0B�w�]5�\�L�~��Q�8�GT�D���������h� ���rS
���� �!���#���]*P���O��vpc�-Z�������������F���Ui�I�!�*����l'7�z(�d��p;�I9���C�G�2YK'4�7HD�ww�0"MC��CI���l���YT?%G���`O�T$C�F;�T����g�u�R�l���$Zp�?���9������������!���o����#�q)!��o"1��{������b�������[fI�MpG
�NBa'3��<�N���h`l�����MAcY�:�{w�#�0:P�����M��Y���%�r&���;��+F�]� ���t8��b���J�_ zvb��p����V>C��
yt���:
�������Ed>A�|���(���3�%L*�Ao6�2��m����~��*�
D�-�"=b3!�%���|���_z�<n�����
�������?�K�F���D
��I�C5�[��J"����a4+%V����#�V.7�O�M!�gD87�H�kr�+���v�w��������&'}���&J�������Y �L~�������B��(j#@(�)&�	�(�����u���M����������M���G������v�,��<���b"BV��*S�mL��4x1��~��OB���#�7�ML��oM�3���s���{�������>MgM7�P�%���}���ntm%�i3E�~t�7��W�hz�d������x�h0���|�A2X��#9����"���p��]~+�:<(%1O\�
���V��d-M#�sL��lA��I��i�2�������N-�~;���mA���dG�T�WP�}�i���)��8�����D��7��`q9��;����f�?������7d�����m���]�V�3O�;Z���oX52��~�"�B�o�t��������g�hu�c�Q/yV�������>��0����.S:��-���oS<� ����~��pp3����I��7J���t��rq���C/.������A� �aM)��?n�b�	�*��\"�ua�?@} WNQ\cs��{>����~�����)�!
e��}7k����(���������\Cz�������ZQ�`��Q#��$��}i��d���1&]����%$H�V�,H`����wi���y���#���'�GD|�?Hcs��������@���x�[����CD(�����\��������b�1������'w8FC�����
��\�`}�R~�.j�`�����M�Q�v3���L���1#p�������d�	,��������I�!8��.����M84��.��_w�������i�~\��C��R������M��W N]Bc�e�=�x�_�x���������	����g�xHT�]t/;C\��#Hg�j��\x^�#�%�W����t'��'����eA�A��b�qC��{������?���|1��S����%r���@b�4w������[�L.L�n\$�����c�Ht��68.���4�4?9��Y�$5��t�#��dg�U�,�&�I�X���O�N���A�d�n]�o2I>���_�>3pew�8v���n)H�B|�?����CX��Y���������K�.�=�w'����:��Q<��!����t���SP�bA�*:8�a���i�%�m�����A�)^Oti~wu��%��nO�������R�{LGJ��V�r��zH��I�d�n�P�|)d�'���
2�$4���TO]�sh��w�[����K��5������-�����	�oF�g@���C1w.�LlK(s�K-��a{�fT�m�s�G9��3)Jn(�$��FA�����%\c�T��T���z#	��3���?����52�E�@�X�?�g��^��Y��S�������?�G�����|cB!�����������`����~1�mtD{l���v��sO"B�:\��)H�+�J��x�S�L���6 ��qm�7=����9a��1m
�����W�~�Ss�1��q�?��>6���F�
<cC;�S}o��;J�P�ib[KS{�5b�������O/��MGnX�
P��M�Ul���z�AHAS����p�O���z���� �r����_��}����(�N�{<������Z%�������SP�`��Yp�����~����^0"�O�b�"!��I�?�!�V�	��9g�)�
�0�����{���5��H=���y��bj]R�����F21�>�PAt��\P;���]-0Hl4i�L�9N�����*��J�W��/
+��7��n�b�/l}�Q������C��k���)w���r\:�������/�_���7R�:Z�Bf^�q��1�xI���fH�X��c�T>O41tE��*��e���}���TK����4L�	o���g�1?��_Wy��}����7�'����1<Xb����Tl/Y�E=�w��B��l�'j���0����3��&Rq�_|��2��N7��[�geB������b��~[c�
��h�Q���y�=
D$S���r;&�� ����!mG����w>W�g-�=�L�_^�5y��`�\�3����`
��v��y�l��x_�w	��y?��^��iy�����������lg��G@�kd��k"�������2]�7;��'k[�3W?r���W�mjm�*��4#|l��l#�����:n��+�<�.�`k}W�JY����?�NE��lL�\��	�x�_��������/�6x�����xy��wM�	y_|�<�����h%�i8���
��'�I��]�8]�X�������)]��Z��q!Y,4��������b�.�K���7�v[��^6"8�*}�u�p��|h��E����xJ�$�1������H����h�B����rP�}�Iu>X
�7�L�T��R��zv
n������:�,m��0��"�T&���H<[��*Zp�������u����zu����b��v&���2����������>O����������P��;���q�R���xn������!UD~u�A�j���A���	O�`��UFY#Y���>�>���C�W�H_p����i��F�>
�K~�i\��s(�@%�u8��;eV�`-$�0�'��"1"jdM��V�
����X�f��������u���U8��u�o0l��Ca��:�./�$k��������;����]G>=�����)��S���O����9��[�������-}|�����Zlm��������2�������4�mN�|����X-����lSB���F��\�����Z�c$u����<����C �"��]���M���v%k)t������^u�S�����&K���-q��o?t<1��~W�\H�QgU/~���^��H��qz���1�������ap��twg�8�_g���d+e��z�3`�u���k���������|�/�I�6��P�8��8\��(VV�x��9���,^���F�������`�$���=�9HC$&��Xe)�����t@�2���������VJ���S&hWY(9����qb�d37f�S�\V;O���<&��+���`�Ks
H�[W����?�/�|z����N��>`+7��k�d�����9��1��\,/���D��vk����V�!<�`d�|���U���������D����c�V���dS��(=WX@��D~�n�>l�C���|��@�Hk�J���Fdx}l��b�%�#h'x��bs�1}<]cw�����,8v���� 
��\h�ekN�z���_����/��jH"4��������k��o��*�	Z�g�,
��=b��A�����\*ww/e_ys	i��s5*�[���(,$Y�@�A�������+5��&�e�?����&��u����.��J���$������]Y��"Z(��V~�SB��+W*bK�����P��1V��<�x�BU����|U�������dm(�ZR��~:���y�}D�*�{8�( a~	�����b��F�:"��.0�(��lK��.��R����}$�����G*�o>��1��;w`6��^���a@7�5������"bC��V�����7��^�G{&�s�#\/�D��jT�e2����l>��F&p��xf(-�W�����N��v���F����{'��d�=�7�]$�m3�0V:A�-��#�z�6�U��u_ |^�)W-`���*3�C��?�OUB��%����
��{tr�O�������{���#.����l
��;{����5x�<����u8V��vF��xk�-�WS��el|������Yr%t`<��M�<K(z���h��\���G�)�:��3F�)Uq2�� ���J��g��J.9	�~���)P&���)-,��iD��8���7#0|7]?|����W�B�Z`�F�x���n�!���vj<{1�:_^5�����=V��N)�|fy������A]}�p#����\���|�T�?�Ow����m�u���v����D9�����4#�R������V���+����������������b�b��'�;��|#���\��>�O{����o�W^�%���R�G��u
wn���x��������������R��B>�_�-91=J�>?����kien��,�7*|��;	��4�!��3���5����
�dpUp#/�B.���TULcZ�^2�w���2�������`5	�����7�?��%��=�&�Qa�@7q8�m�����v(8Z�r�tqM��6�8�YMi�#�O����D���������Og7[���>g��R�\�`B�|�� _A�'^�	g��(>����2��]P�>��NcN��om��2|a��X��pS�!��f�Q���'��1!�0��i1�P��`��w���s��V`���j��g�����ha�:�]$��?O�#J_Ah
"��Ds���c�}|=���$��0Y������3�T���^�V��O^���%�=�����	�`~�)��3���x\�J4��/����e���HkE�Q�o�v�����yk����*�YU���^���*�e�
p���R����f�y�1�&�`!����c��+�p�-c�\�����e������n��gN0S�K���d���R�#�D��>�W������k��#"{
�����uE3�)�
	�W���>N�8J������ F!�I}�p��!"-X�&��#�E��Xo9������f������.']��7J���Oq���\�!@8":F	�%�3fuU����cy�`��&�%�.�="�{�8������+��x���k�(�v����OM��s�X����,��|�_Qhy�r����mv.j��1��HiA�U���!��q�/W]�bU�O]�(b���e��I���$�l �jaw������i��i����������a�~���N�~#Yem��+2������}��_��|����Gy����B�$����v�L(
z��ZV��k����pW����(�`d��N������Tp�����N���-b�`I�KjA(�����`3w1_���l�'�{:�4��M�~�~y��i�6YD'R���J�����`��~�u�K�\��5�;]�IB��9[m��<��B��$������������'#��]i�!��3��v��L�,R��%��J.�A�O�:	�@h����x�s�t��0�@��PQ� �5km0wu8�,��V%�����m����@HP��'����R��dq�������0���p�E�:�4��Y��s�q�T���n���m`����� [�C�q`U�9^~�(7�������L��2�� ��2Va�+1�-^��zJT��A+A�3��pi�R�,�� ���
�"��v��W�E$��?�^dgc�1����S��!.�~�5KN�M$S���;bq��?:����U���������@e�#x��W��dg1�Iz���/�fl��q��L_���8�2�.��;�0?K�x�=@����s��+��{kO��%'�H�����:���������-��DM�KHB=x9S��5���"]��$d���\��m��$*�������My���<������i��
4��B ��!�u{�b�����3I>M��.���G��>���6���$��B�h��`]���U�N3�z����o9C��=��M�Q�w�,��ExTS�N�q[���*��/H='��_�L��x;C�2 �����W*���A�U���@��.J����F!�������T��:���r���I���`m�G�q�����-��X����'���S���6�&r�������-����tN��:R�lY$�j�rD
�����h��s�+-��`�E���dx����Ep���eV�;���
�����C/v/�@_Sw��C���|���0����rgDNui�:Uc:QaNx�y���R$�
�[=����M,`�[|�4��N����P���;=�"�U���\�m�NY|�Q��7�rq�3j���e��W�j���-"����g}8�:������+������O-'�����) ��e�_{��g�,�����������.��4���������g��_�����2��������������������'a�s��kG�9�U��tQ*R��\�����gt��bln8+<��\H�L���(�!�-��I�c# �'\w�;�.}�����F����f�]�fo+b���p�����p�BM�����K%M����
�;���K���<@I�����R���r����}�l���t��p����z�������B�K?���	
>,���2M�=�k����uT�y=�J[�mAR�����=TI���1���nqr�A�����nz�X({an������qc��H�����+|�qd��f����Ybi*�V�����Aj���������KM}N��,�*z��,U���l���j��Q�kI<n�"�_hH.�f��$�8HZ��-��.��'�\1V�6u�K������U!�-O��T�|�F��m��r�r���v��L"$�0u������f47V�����Hl���y��b{����O�<���{�o�E}#[K�����GN�$���
����k�*�W��3c�uOqE����f�r�w���n�8�c������8����U���6:���z:��R`_�V8���P�!}���g���a:Yv�"���mr����d(�Za���;�~R9���A�Q�r�q^/�t��
x[�X�^h�Z�T�((�$����K0�=+}�,?�$j��\L�z���i��U��v��``S �`��b��d�_�*/�W2���$\�V
c?<��"S�B	u[�1����'���_����b�Ju��4x��c���F_��� C�X���
��{F��\p��0M�+��#6��K��$a�d�z�Q����+H�����cSsyw'wY%�W!�=�6���S
y���<��_��sZD��Qx��$}J�miTa��dzh�����]�'6��{T;>��&���h?�9������%K��x����#����������g��L*Lmzk�����P'�q�1TL���5��}=MW�$sK��d��K��1�Z*�j;*pOU��uTNU3'�����c&/���,�����R����>@�������� m�A��TR�:���"��b�<��%�2���U��0����F|+J�ujD`��K�������f�|���*�r�a�����j��<+����)�z�R��eh���`���4Boup�8�������f3c�Z�4�mXV��*sn��gD�,a�x��Y��~������-:�����3l����������
�����������Pr��M�6���6��R6��/�'{�go�l�z-)��:N��;�[i����Zw�[��:m����!����;������D�*u���|�����dfF�����!����DIu�����b�W�xi����
7�u
���������y�-*jo���Q���a}�7��c>.A:{��(�/Qn/�2=��g�T���L`"�w�tH
�H
�)r5x�"�%�F��%��p��F;�R�<S.�z:�`3�%�����3�ky`��J�
8_��9
��h]`����*o�q�G��xC���"=��=�Y�"/��d�N�~�>`�~�7��l�YX����QEP����+��:��L��N���Q���'��'��'��C�1N�>�������I�[�_J=��A)�&��$����Q�/�nB!�K�(�A����h3��C��C�+����#iw)��F�N���L/n���R��8���D���TCP�����[���:^��%_�#���C�� ����a����0��X5Pj�m��Q���9��������bu��F���e�\�s����a'�����h���K����q w�[1D�^Ud�+��?�Y+���M]���7�������������F��+^������L+�u�G�����-��n�������i�A���'�v�z�&1�I���E$���Sz�6T��6�R�B����\���T#7W�B�)�7YJ��|S��B���
A����nM���5|95�Q��M�k,%k����$(=�i���3�����;���s�3�2��d_%R-�IJ�m�z�f��;*�[Pg/*��g%�A��j��-|�QY�X�{���Y��W]�-u(IE�3k�6t�Y~�)������VM����M#l ��CITM9p�����]��BC.����m��3�
�����&�8S��mi��M�A=�#c�}�6�������?9;*:�R/���������������RgQ��L|�����EU�r�\
�$�U��yv@_N�J��H,cVd��"�
�g��(7-v:C��N�\�8;������f,��z4VP������*|�����w}���v��83�����4��O�>�C	�oE��b��j�n�f��9�
����Z��2M��!mg���_���h+^):%��&�pI+1��������r�{�Ho�9~f��U{��� |���x��L����Y�d%���`�#��.SeN�e�p_��o��nw�O���?��;^�J�.u0IE�dF37�=*u"5���d���.�<�>G�8�[���R7���#f���^\3}�&����=�)���&{�p���'c
��������-�p����&�gs�R��Z2���DH�I`�EIu~^LN���\�������:<��X�2�����`��R�D������rP������������=E��/e����{�J��Ey
>��c2���H�g�����1c�������+T���Z����@R�r�,�[A4axk#�5p���SL_�
e/�U����8��A���T-����
��[���Xw�z�~*y������p#O��]���-K;���IGQ��e����jS��h�%����"%�7���&��k���f*b��P�G�����e�\����N�5s	�r����Q��%
�2����`)�57��c����N f����Hb���$�"�|��`�JWbkW>!���ZYn,p�i6��'��v.8Zt�='/z��:I�/�t;&�WQ�dyjsA���Y�~8S)R��(�R��i���\�,�n��P!j�2��2�T�/�_��/��I�)i+H�vf���/L��f^��V��<*�&������Bc�Q�����u9 7P8.]����1~��A�!*T���9���cC~HI��,^����"`�h��l��iKg���l���|�G��z�%��S�����=�}H��x��a�
!Y�m_RL3�T�4��v��y������8�G}t�Th�qM&����40mM�gE?�*�y��=��x�<e��W�^Rg�#N����.��:��Ucz���5�)�m�3g���Ux�����^���F9WD�|������&���{������z��~A�j�],��mFU��]G��:�	�y7k���LH��=!!�o��������*�(�R�Po^J�HO��)g���.�j-��cL��Xy�Y�
g�V��������^�y�,'�m)\} ��� �����W�����/�MF�Lq2�����'��� IRN�
f��}g����.��t�,�[���:����,8#u6?W�[F
��C��"c�w+hY'��M������������.�^�#���Y������>�9��PT	��z��j�����d�%<��i��������R^���>]����a�4����m���DK'9j">.�DP��*�z��V����Gi��o�{�����r�Avt��u�km?��������M��m�}rKL��j����0?jz���6�|��p�,}��N����j����g�"TZA���+�e��P����AO����kX�o����C_Ue��^%5�]���H��<�rvTulEz��T���<��xM�5$�|?�8�U�&�P��Yl��U���e���<�~eM?��f�q&�2�R��B���B{
v�l������w����b
��d���0��w|`�����T�������vuU#��@�o��
K�f�����S���2��o�B�M�.��3��,D��%*p�p���v���R
��A���<��G�Q.�W*d�[��f�B�-V�"L������yR������J�N�-���o��ZW������??`��L��Y����Q�c����G/L���U�a����;E�>�K��Z�T�V-�����*)&��0p
��p��R�:]
�G�������I<"r%�"��-*$���Z������>=pg�T����%�{��B�����t5�TR��G�6vx�-��#��Hl_%�<,������+�k��	W�U���W��T\
 �:kUj�����Z��t���<�����]+�+�����2L��x���c%�����"���X��)��5�l��c���r�a��*��L��I����
Er���R�V�����VU��}&���~U���/���	a6�2�����
D(6�C=J�4�d����8����s��$�k`�2V�9�U�p�Q���fV7dT+��[��[�������0�ib�����q��?1�2/��*����n���������f�"S���U�G?C���U�:1]��
���7�&����~yk�A\�*����r'���mus<Oz��d���A�O�WI�,�������oF�����,sE�$�p��y�Y�#s0T�21���Q8�ZV�QV�Yo$�_�)���e%	�Q����sY������eq���k>��e��-�l�t�YRu�����n����{'�&*��
�8���t>L-����&��x[&h2��ga����*��}/�v�e���^t]8���c��f/�U�	,�4�'%�'�C	Z�n�QM� ��k�L�@��������"��r��}�&z!������jpBt��W�
�T�]�z�����p�����������Vp�����WE��z:���X�B����>���2n��48��$��}��9��u��}3��@��E���GP���L���g�T���t�Ow�.�8X$
����2���i��;�T���84LC�p	���;��#T��?�����Xt/5~Y�P����:@5B����W��4x ��$�T�����V�c$������g�$pe���A����I�P/j�Z'��x�j��MU��y�p���0G{(�MU�k�.�j��;�~�m������`py�:�{�'���L6�
(g&](�l��
z9��s��~���F���T�q*eA�N,y<��w����/x�3��>;����<���9�hz#PN?������iQ���#�5)�C���6:���:�ti4���3I�W)����I+p�p���=����z&�t>=� �N��{�;�k��Z`.8U�FR�?5}�PqC����{����P��oX�%��m�K����v�+5F���p�r�s�~(�t�� ����f&��F��W'x!c���^��b3�"�3l��
0���I�b;�*_�����X2X,*U~���"U�����;�I<���8�w����M&2.bL�Sjsk�
���E5�R.�\]�.o����l"]��x�W5���7�WT�]>��_���)�r�������D�B���UB���	��2�}���N���|8������'��<�	��\r��wx�����j���-��.�-�W��-��������}�����i*�~zD^��������E��_���+�������Rn`�Su���y�I���z0��\������3�8�|U�Z2)8K���e�2�\���1vwNg�M
���
RTD���!N{�c�J����{������(s�p)�G��
36��l������	������b��\�����K�Cq�@����+����`�C�lsh2_���B��wn��JV�3�����-�f��\������<]f�hh�����z�c,��6S�
h�/S��������������g�b��UE��"#��	�G����L���������p7e���iw9}.pZNT't����
U��T6�Sj��=8s=0���:��M��,���@���J?��2?�<v��S������o��4lC�`s��{�i���"�Ha�b�3i`����*�7�]l�HkR������e�GK�F_*]Gp�/i�'eSm�qwj#mu]���[����;Y��]���i��=������#5��1�����<�Sg��!���%c����=y��)�&�&(�R��{
S�����,��~�D���v�,sU��\��J+�L��9�15�O��![�l�S���{����&����5yU�a�*3B��a7�e{H��P��R&�:L��s��)},��3J ��!��^24l���1izX}:?1F�W����p������
�������1��J�^�NH}3�-F�����:��h��������*��*:���MV��"�V�����q�^���z	���0��j��Yn���s�zO���^g�x~������3af� ���ewM/k��I��V���N�`kV����������[`z��FZ��s�y*zw��4�s��m=KUu�p�6��������8d�����)���l�,���C1�f�9g�s��y��ot5����iO����P�$��Sj�Y��9_���s�b������R�Tn�N�/M�*�QK��gj;��<��61��=t9s�:e�R�_ZW��6^�������AH���'��c���KT�E
Y�};o�����r��W���*�_�=:���PV��\�Y�J&��K���%&*�������!�������bs�G��
�'��"�k�bY���I!���N����wNt"tV:<M�����cB;�L'�0��
�3��1�B�+������T3�n,�
a=�]���c���lJ�������To2!P�7�my�#J�3�^��3����V�{:�x�����g
���=��q!|L�u!�H� ���X=S��(����j!���$��\7[����$�O���'k����)Rh�x�}7M�l��2�w�O�1����@��t����.�@D�S�;�>K����.�O27��p������6B��h!�PAFl}��D�R�js�F|��^u|�w����dg�<7�J���c��s��bA(�.�u\5J#�����Oj�j1������-��8���x(,�x�H~W,����/�V=c���
�r|��}����:��4]x|�B�5��|�	��$�(tfF����/eL�!������5N��~�vU,��!��O?����u����m��p��n�!������R��G�~L��/��t�8l��%��V���!l��[��U5Eof�,W����2<�U���.�y����gL*����NI��MgZj1W.������J�r�,�F��i��Z(��v�!u��$6�}�J��Q������J��g3�0�����]e�������Z�q�%�-��g������`��.[�U�aQM^t�Q��R$��}��*��]s��X��n*��F�9�	N=�����K)��U�#�-���G��o�_�IX�S,���.�\��`����c����A&��*��x��N~��d5������+*��9
�ie��-�������w��&�6�)C/����K29�9���^���Q�^X���=m)�W
|��'g�U�u=yh��P��T#]p���|ul���S�C\P�;��h��Iv�e�S&���j��#wf~��h����.��=h9
R�z#5�7�(5���u��{�*����z"�����1N�L�j�'i{!�Y��^������+��V�E�]�Y;��9d���G�D�������tc7m,��25<n��m��W+C�m�N����AE�����p�2���w8����z%Z��l���`�M� W-��:g&����6s��������)�C��7�&oX�NX��5�����0���H({1 8�doC��zW�~~h
�����2sy�����En�$"�Ot�D|#4�
`��yjm8[S������D_`J����'����(e��(am�0���i�
�UHX�n�)��1���4�e+]��?�
c���K��<�WH��Y]`�T^i�E�N�`E�XLb+t+*��e�9qV�h4�+z����mQJ~����3������)��N���w9��v�W��VJn��1���L���g�.�X~RL�r�@�`f>.�����Vn�n�e��R�������j��5�U�s���{��Xw��%2�"Nqri��������J�i�k"@����e��&Q�����f_�����p[�
�����G����
7��G�1���AzA��|4�g����U����&�2���q�_��4�l�e���C����
C�y��#���l��9�������(�/�g�+
�G��Gm��	v�Q&������M�g����?�t2�RK+\�5�(����b��E�8�����(@�t)\��.Ty�:��}31��)d"�H*[����ZA/8�� 3��x�%��e5�;1��g���BOPbM���^\���k�S��h;����"Wu�����`CQ����w0�I�4������}&�������8l���@���Yxd\�S������h��U��r������<v���!��������
��m�4���HVD3���4��l�����ub"7�Mz�
��)�4=q�3P��X/���D�V���-�U�p�
�m�A��%��8�x_����t���o��T���4�f�k�T�L�-�b:������~|�D��#2�h_�����yQ������S1Z��`�3	N���%������f��\��]e:Y�����=!��Qz{.azy}�0�y�{u��#��r���8�BWH����h(�"�z�+W�F�ri:�b��fmz�o<����l��|r���!K�����N���":�p�y�M�N�eP�����V�}_����������������3����6�YOn�6� o����Q������3�D��
����l���>����������� �P���Y���f�������#5J�c���i�+eR���-���I�����]����������������K������BF�����7�V����W?����gSn6����o*�fn��������70^�rSqg_/�\�u�	��z���l8'��Y���V���,���=�N]<)����?��u{�{�lV�z%W��f��F�p}���8������L��fOX���to���U�fAm��o��=����c��j	�����a����a`=n��FOs�>k��,��-���^�^	_�#�/��7f�S��s�f���.+��������6g��Ua��8�����:Yw���f�������,P1w��O�����{��ri�[>l'�����eG�b:B��l���|UI�q�����!��������4�I�����������I���J��]��]������&�&�J(_n������]_�����,��^�����5��$e���xq��*�&lY����C�j�|Y�d�z��_qI}�d���zN�7����&�>Zo���k��!l���0w5��l�=M��l�8�f����n����; "��;Nx�F���b\������z�����K��t����V������w���y��{k����Z[��k�Fc�����K�T#-!Jbh��!�y4�s������YO��q|�hsRW�G�)U����I��IWP�����|�u"mN��DXo����K������K���\a����+6Y�1i
�
�����u��WW}i�������?J
oo�tS��+��y���m��K�y����Ny��Y9��-Xo
���9;~����1�W�MK�~�o����F[�����ul%����4wk�V�=���_a��m0�l���{E�~Fa��\����>	�o�0������c$�.�J���r��S34��;��n����"�l�*�����'�l*��Pzk�����.
�� YK�V�O�M*o�*���f0e��N�,�N4��g����*�f�z#���a��P�>���o�7������o�2�����0������U����8����/�Y��>h5=Hq���4	���O�)�������������4�����4Y���Ik�D�x�|�p��@H����,_S40��!�<uibD�H��k@�S�������7zX����{ Sw�����v����~z�������g������S ���6m�6�<����q������w@���iD����J�k]�N�m��z��Y:W�-c�������0�K�O����[�S�t*D�R���X��|U	��������+����(]Mcz����KU������������h�(N�k�D��0~6��O�\��L+��[���4��?%����'?����z�?�a��W)�Uh�*pX�3�*�2�o�*
@���?�U�j����?WC$��_��n[���N}���y�Q-�G8,�`T�*`?��I��������[5��
0[����f.J.CoP[�E�yl(m<,\����g�2��~o��}(�����Q�U��}jzDx�O�Qj���O%�(��R�5Y+I��[X%��`�X�3�v�5�P�[�B^Je��vQ�_I�aTS��J�)�B^>07��`��S5}���l����4�d����`�d��������)�����H�_3
�n�o�5�!)�SK��gq�;TO��1b��/�,���Y��*�����wK�������|?F<D2��|:���P(Y-K)��g����~b(&����X7�.�:����}v�F�e��K���`{����/Ud�*zi����>-��tf�4���W|Y>7��m@s�?��e�T>������������b�N�)[������!K�1��S|~���2�
L����� ���@���3{?0%\�������l-���
H���e�����8�AF�d�+Q�m�ng�
���f���8���)���
4��z=�{��������K�v��k&��/����KNY�]eL
L��/Im���R>���/��z��
F�� i���=�x�P�{Xd���W`������`�3s@?>k@l��D"�m?�e5Y� Sc�5�c����T���W����V#�H���y�/ �����o�g��"�����sV�����f�S<~�H��A0�����/�cMK�2|~�4�k@���L,�x���z�o�����~���[���M}���B5J���PeME�.5k
�e�K��1lR(�F�������
�F1��{Y>U����X�U�"ub�M�^I�L,�_"~�E<�!l�1�Poq�NV�J�u���X��V�Fl��&�����5�C�e�!������1dZ�Y����dAd�hG�>�o�$��_$x��EF`�����}��')$�_ke�&�����Ia>������)A������S���:S�pQ�y�;��b��*P
��/�j@5���b�05y�d�����:0���Q$�J1��������D���K�i����a'���/�0^F���,RS�U��>5���/�8a���kM`�^�����~G��>F�_�(��/��?<2�B���~�N��j�����7�<=�����$����j���8��~?q�UG���O''���S�9hV����	hVc_�`�����N�I������N?���\���(0	z��!��D��a��r\�=a#���p�5�N��s��Q�aZ����O�{�>fD#��X�H���E	I�������D	��������&�����.����O�^�d��A��+�9]��S����7�,{}�XL,���Tv~
G9���pz�s'�����	�4��L�GY� ���B:3q��G�������rK�~����.C�?��)?��Aq��o'@PN,�R B���r�����������J�����OJ�kn�X�J�BQ��������c�����������b�Me�X��0CAA����X�[;���g+�h�7�XPP�?�Z(�������-�Y~�e��6����F�,�.-�?\�Q�������.V���r>���
4<�j�
�����ZI*����CA=u��V��<WK����������y��X���:��;��X�?f���1�3��7��c��8�;n�����������?\@��������=�3s+}[����?�v�[�llm����W>�-����ohhli<�����W#�����A�����`'�>&�6���O���.������o��)�����[[��'�KO��������|����M�����L��%�h*&�rh��OKOJ((YQ��L�rb,X����G��a1Hehc�`l��!<���gx<4������~�m��m������
�pqq�_��9Y����>f���`�bc����
8�X���Y��������G�1�=|���������/�y!
{��"+�g��;�����47��='k#f}{s{f[ck#CG+�;���3�7�736v���� �����9q����UR��"��
�� �*��x1|!�,������IHHHAA���������/"""--������i```iiioo���������������\TSS������������������������k���4��C((,|��O�]&���x"�gl��E���N���hyS��������S| QWL@�!y���u�����~��}��l|�\���Om�8������1�z�P�2Be�g<����G�-V���S�09����,���BOpYe��:c�z0!�}� wQl���lr5"|T���1Q�L1��i�^�l2i���e�?�iH ��p�����ZhiG������@���%��&zR��o���$A�T�W�	&%�Ku�Q���C���:�RQ�o��v�T�����n4s����P���;�F��Bx�J5"�
��p/4��B*�	$��wt�N�a�E\uY6��u����D���3�8�JD8~�������������.qa��q��llK��<�4Gs������+������2����y2�����)����=+3��$Q�����j��;�&�4#�|��!���h�}���sd-X��,��Q^L!�^&�!���&��`��P�Pg�:�\W��������e�H���o���N�*f���V+�5Q����>Z�M�z��3������<H����'�TK}w��/=�����������9��7EW�;o�z�f��
s��.���Lkq��/+U���Gb����fy���y9��m���%[�^(����S�i�`�1`M�IA��yK�:��GR�i�OP���M���u�X�.J��Dz�gf�{x-,(�DE~���u52���1a�������	�lK���V��b�(��(~o
����="q��n��i|B���P����Z� 3W*�dB� �����X����A����e�[��`�)��,���~
�I�.��+1��:��(�@�"���w�4�����l'PW����	!�� �_�f oz��-�-�|�|�2�/��4��2r.��Jf���g�MZK�s�X�����GG}��]�<�����^d��Up�u�~�w��R����B��_u����6�k� �=t|(���(,���;������K|�jF`�������F�
�b�RE�\�0�)���7��y�1�����z�&u���&���n�e����Se�f��:�\�9j����Z� h����yz�*r.�1
�����7
Y�m�,8WF���uY+�����~r���8@|E�]�������E�2��	r��B�:ZP�������Yu��z����w<�$�	s�����^!��E���=t�B��BX�z5����B�;m�^���e�zz�����C�/��+�����V�������}�X��+�z���P���GV9��Q���fr��]��Q�eY����RS�jR:���9F����Q��q�Wo��1���!�H�������*����Q����dm�=-�5�����,x�>b��9�5�M��1X6������7�,g�"@����R�f������d�����-�rn��1�H�2�9~@@�wGfb1��	-��XAP��D�|Cpx��@=u!���
1#j�2�XT�\bA�:��b��F���`4��R���}OW��1���d�r��z�Q����C�������������=!�ef�n�CY�{��+V�i�cT�h����������Y`�����
�I��r���������>���~ij`
	���1����U�GLB6@��� `J;���WE����M ���P�=�o��
�xz��H�_��Zo��5��m5I�R'Xg��#�������
�������dt�����O�`u����;����KIOH���D��,��������	���	��@�0T#[^�	(`Y�`)����m)��m�^��4�H!f��*K�gF�O���S�~���JOc�a��p���(�^C�Q�$�@�������	�+����)����r�M�m��	g�T��,�SN�(�H� �*U���j��:�p��/��N�,6

��sUF��Wc-]C�v��=F���!�n�)*���h����q�	b��EL��)g�b�-�*Y�?�\h�	�������t�����R,>kFw��
�r�8�d-^):���z�cS\�&�s�����F�T�	�<z�c�Hc�|.:���PY��[�?
������7�$$����e@P���j�>wi�R"I����J�7=J�)�{��:F=^%��������v���
�}��+SN��]�	�w	���eD�r&�i�w
��d:s�WJj����.�D!1��1n����������:���>,>�XS���W2WJF�	���$<�*�
rB��a���o��6K�6u��v#
M����b'�C7���3'0}2p�):�v=���Jn~d�~��:(B(S�i������M�UUT���Qe�v��R����gO���?*c���K�=@�ND����s�����UZ+&�t�������&�AiP��G.K8��N�Kh.�/\/��f�"�)�)�^�����wb����'��-�+����uj�O?���������5���~�I����s�S�@up�S�X	jw���=�O�W�ONY�9a�i��+�xmI��*����Vs�_-��>�������Z���[P���.���b��z���k92�4�.����s������]�'�p��LS����/�]?a�z��eT ��9���k�^������A����%l���	��Cd6��#d��f\<F����1V��qg��
��?��(5��3G�N��_��V>��U:��F�U?}��G��Zs�%w8c��U���-����/���Z9�}�U���a��C��#�,��P���s�4���������{9�s�|�(QB���ZK�7��	��<��:��4u�P)b9��p�i�eG�u�Lh���YroC���{�r&�%���M��C�v'-#�SV����d�Rz��Y0�%2�v|rcV����7��51��u7H��%zqc�h�����F��Dw��|_��_�6����-g�L�KAi�I�~b�%��� ����U�.����K��~��%)9�}�u�!(?�8���&\��-����5�����i[J���(�E
�}�V�*�R~����T��]���������������GSH�������5p7���f�I����e��=k���r���f��\,�
E��i��mlT���D����9W�v��h�	~05�����p����),��v�T�����0�����5I��i��`>�T�a�o��=+�
�������+�BW���������+���7QZ+C���k&�A�,d��i��
!<�����'�=_E�+AG*�.I����<�X��<J�J4e��~sQ���X]�f���.r�abV�)�����A��F`1���e��i�I��JT�,�IT����8+W8�������������x����z&2A��2h`
��mv��
	,/V����&|~%j��������+1�[�+L�)���)�ff���q�EK��HB�r��.��+"�BT	]���E�������P�aK�Q�".=����C���i��
]������8�����[?����|Hh��:��,��>���\L0~�I�]����2fS���Z�?�M�������%t�ki]�VQ�Z��Es15'k��k\JUc7�	T	U�������R��a[S����U��4�:�$����Z-��y2��YL7��r [^�63�9���EWg���	's������<m<��T
�����p(����������U��zJ��<�<����|Xw��*z`��|�+��=�5j:�-��q��]��y�%�;�������Q�=��
�\���n9���$�R=\����[���X�~��L����k��[O��6Rvr����Jb�<`��o�4�i����u���7�Elvn�j[ID:�nR��0��[����r���2������O>A�U��{�b���n9�����J�=�!����y���J�#'�%�3h����\���N�����
y"���;-�[{�?��uT�7����
(oii	iP)���nD����n�����
���
��K���}��Z�Z����3�������v��o�$�}�X��U���f�����kH�R��h��EuPk���lF�KX���?�N?]�r}i��)�V�~��b���a�E�++rwj
�`�<`O�n�o?�
���������\b��:�����
a�`�����w�d�J<z�Q��(B��
�7��=����+���g�&�%O��������,�T��������^���I!��V���e|���~T�����|��l�Q�>B��l��/���ML�������;@�un��c��U�����K7e?���szY�������<I�!q���Q���x��v�e��T\e����^�H�����������Jg�
��b$JA�
���@��(Y{8*<c7�+��g�5d��O'
������_
��VUc���5��i��(1�u
��~S�8i_���H��/����z�������W����lNK������?0���I{��P5A}�G/g6s;zA�������O+���A�
���#E�����rw��^;���M2W�E3�ii^�7~��v4��7e��9����_\S��,��r���_��_������9t��~@yNhm��i��q��(�cF�����o�k��E�<�Y�-:u���*������C�j��X�*��{��Wrhgq�|��OH�;F���B��������IL���3.���<���b�gB�bHf���y�D���(���"�~�=��?^�`��;�
X��@�'�$�1���v�W���O��u2F�����m+(U���fL�p?Ec�r�xG��h�YV����=�:v%iz_�'�t���L(gN+�)t��i�M8����b��d���BK��?|���E��~������;��^[#�����M `���6�p�x���m�*�A��B��,�� ��4�@�U�e�a���k�jG�{�^�Q�|T�xa��9rf|�M���O��+&��m���!e6�}�\�H��6��q��N�;u��Xo
���	�\������g�+���?�;��m��/�?��b����%����Pd��>�|�o9��(�P��[��Si��\n��5����&��T�>�oh�\j&^��~o�������)��7����5��~��.���|�
I���f�^c�*F���b�j5{U��
'W��u�c[[���^��{u%J]m�F���4���z�7p�R�Q1��������u�?U�C�9K=�-^�������w@��v��N���l����Q�����w�_���5_[�l����� ��`�8�z�l �:^��{���{lT���G_d��<�}��|.���i�}u�����<-9?n5")�|M7��3}L��z����a-q�o���	X��7%J��`���(\tX��K}�����]>��bG	f��z��P]�:�XZ�Y�]���?U���[Iw�u�Z���_��o��M��xM�y����{��|�nmI~	��|�����	J
))<e������x���^K�y���;Y���!�J��I����~"��4-I|c4La���6T��W[�?����,��^�f���t������`D���=�b�=,NAO�N���q&kE#<cA�:��/�_$��l��KB�{��v��~�7�_IXSi�V�	� V���k]s������n��4d��b+vTv�������z���Z���|�{��C�����j��r�����w�D���5��e[Mm�v�r����}����Jc���������;<�G�2�,��;[���^H�����!���%���I!������'����W���v��z��/rjM����g��=&<F����f�j�s��~�e��&:�1���H�c�gR6��|�������D�y�YDu���@�������W6�+?s����@����8d��g�$�Z����}
��P���'�&z�L[���D��|<���g�?C
�ww36!��}t�C��X6<����<���7���WJe~9��?E� je���_�rE���:��)���y����N���k���wJ��Q�]o��v���?��[?�{P����2���H��$������H��v@YJ����������]���}�?��oc�u[���"���J��t����a��X�\����� x�@���++6���a������^A� ���k����{�vd�B'����H������*���dN�e��OkV��~�OG����ku����pkDT'�q:��(�K?B�YT<Ik�LDX���O���GP�vf��d4��d�]��c���!���qwBF�c��O��W6��O���==��mg�@A�:4�#�J�-d�S+�g�P�a����I;���%B�*���*�T�1{
['�7��,x_Z�l�u;�J���1����I����	g��sI�-�������D{c)��!
��N��{��{���X�Atq�)(����s�P�L������D��(�3��//f7|GB����;��xJ4%b��k���_cy�?����(&X�3G��?��E2Qo���fxz\�[����k\J����(�i�:>����Ri��Xu����p�t�R��J���O?���d�ly�}*��>����w������3)����4�R���b�������#��c%7Ny�O�Jf��k�������cV�D��j���#�.���W�F�Lw��3��w.X#���C�s!b�-�O�]�i$�4�" �GY��0�g8
Y'��["���
�,Ku���S~
��������\Uj8��w����dh������I���xv�s����=�mKd���1#�&���YJpt�~]a���9Y�f�a���Ct���m-�����8O�mI��3�u�Yf��o_�L{�LG&�G�g6�[`~����?W������bH�k���D(�2�>�\�hNRj�3�_���	�i���Ox��.�NmT��(]/�9��Q�\��+�(������k9/8p[��/�T�O3*�=��0���V�7�_�:���SC�|"�����od<]����8���f����7umit�D[<B�
(�a���g���|����Z��II����A�:�~V���Q/J������\�O"H���(�/��n�4��
�
��P���_�K���'^SD��C������#�,����������f��_��?��,7=�
��&Y�=���������R���^�)J�����3o����#�<��~���_�*	�nin=h��������\\`�eOG)���+���������[1�c
}���w���[L�W<&-�f�����<�+g���u��y�Ge+�D<.�Q��x��ZL��:��eRh(��p@�����9[��
5��5d�,TT/�������Z>��D7v��J�~;C�5k_�[��WF�'"x,��#��T-��L�V8Iwff�|�A!�u�T��W Q����'!�-��\f?,�
���s�U�%h{���{"c���,1�������V��Z��a�p���w��>�	�S������1[&���5����!��L��\��V�dO�=P���}5=U+��3���iw���~���9�G
�B��1�Hi�m��Q��N�����J�E��0��tk{�a��d$���ao��,4C��1H�E��pD���
Q�W9�E�[�4��y�>$�a�������q�c6���i��xL1+Z��:����*��%�'���4W���,�4G����/��M>��z��8F�"�#�}���j�lm5_&���
���<�Y-�W��LW�������{�K'�����?�ox��O=wE������uP�1�����xZ&����P�i�6'�z��)��o���{����~�m���R�L
R��v����ZX+Y:=>�N�D�'����f� ~�m��C����|��t/��R7Ob���q�	����uq���+�_o�	�Zn��u/dy}�������&��f���2l����kA[�Ec��2�$�)��D�o�+�Y7�����Xnc*K���2���lll�$��
�T��P��Ho(���	r� &3v��t���q=�����<��
QGiW�'C�sD��(P���Wy�Sz�vq�}�Q����^�9�_QEjY���/���46��DOdv��z�������_4u5b�mM�����k^���O�/w��]�������?��:xt\�d��+�hM������m�����.D���������ok���>D+������u���b�����/����/|���������������m���4��M�?j��1\�nA�Vn�P��)w�a#k�: DF2��bB�t� �D@Q��:A�_Az�;���m���h8+��SF8�Kh��	�
>�$P�5��P
�H��f�U�He��-QIR��n�
��(�&*����3N��)TYI��/��w������RpK��m�W�A��}Bm���@�]\��9|S�����T�ZU��Dn��4]���h��*�Uy�J&���V5�{�'HC�b�V%+��O�O��N��r�l^=�%�|�/� ��=��|`�����1[J�������=����o7���9�o]��cs������jg�����w?�n:�u��[f����r����	O�������@�����������F��������^�
��������!���u������ss�w�"f�?��
��`K{'��7��xz�gF��>��j��&&�.��������'R��d���/��&D���Iz��7��nFz�>�X�De����rM����I�9O�<ah�m�	���T����ps���j8��+K7������f��0��i��*.�P������
��B�>���,	������n�����5�UyP0;�26.L�~��M?F/��L<�����	qvr�,sb
3s���@���������L���s��uu>���Q�p�s�\
�JUw�����`�U�9<	�:Ip=$�*�=��8!6�>\�n_2U�<Z_����g0:=n:���3�r4��{��3V�&K����3�7�tgb��hg�x�B���4!�&{bi:���p��~��������\�E>����RY��Kt��\�1����iRC������������1�c��';�}cK���.���N;�cc����{i	�	���vq{��'\8'#,�\+���rU��\,��
�0��;����l��98	�l�;�N���N�1t����dW�T��a�bK�H8�Q�Su�W��8�8�U���Sk��G?=x'��������z�����p�R�{j�*:�w�3��9R���YP���2^��pI��pz#��p�!�t��l�=����pKb�����������l�z�����C������N��G�x��Y�Y�*���z�����iU�[�v����'�|�K��9*�<GV{�K���9���a�<.��G��~#K�l��	*�kVN+����%�����L��'X�';}����2�c:b���/-%���U�:
�����?������^��;�^s|�3�c��W���>��i[uBu���dqQ�Xw�d����FM�>J�LW�~��G����k'|�k.Q11��|��j�����|�5'�e���+�����k.��]7S]��[�:���&0�8�8
��wM,��8Z���B�U���1�o���������o����d�+���`���d�|��D�
rRu���R�c^C���j��R�|9�N{|�!�{^]��}�]55.������y���O��D��dKs��v�N��N'�N�O�h��'xG�k�[��t��]��_p-w���:�a�Q�A�m�u��X�#t:�����v��Z
 �tonq��y�uCO�hJd~���e{z��so��~�K����D�����hB�P��u����������&������@�mn.���Xg�kuU�A��kuIi�s�����}��������uWW�;��I�k?��8��������i�hG�����"T����^��"a���^�l�"T���
k.�������������4��4�C�U���Z�������Bn�����S6���76~��\����{�����Tsr���X|3�p���d�~{3f�6wB���d����x��0��v��i�9�jQ�jE��k��~}�l�^�������|
-KM��z���Kz��������|�����	����6JT5.����l��4U��.���N���L���J�
��ty3��r}�r�7qt�Z��-��y7b�&��&"0��T�����u���d��do�eX6���Y���ews���`��������j
O��������r��f�De�w���m>��S����G��\����&&���#l��a���&�C][�1I�W����N�:�mm�;����;cr��4���J���'N�+bvU�J��*�rY���@�3���I}b���fq�c�qqi��k�]Uu��"��6�N�H��F�������tO����������L��z�JU�!���������zN�J�4�S�5������KF����,b�;�;"��0`�)���	�~���N�~
����J���K��r~�5�Tk���$���W]�s�Md���,o��&:rW>�Rt�p�����V�� u�����+R���d�ps2�A3��d�����|�����kN]<��'�T3��q���
o+b��vI�4�"j^�FG,eJD6+����V�T��B6r���n�k��Z�3TH��-��,v�>0b_2���V��r�b��vz����mX�����6�}�_�	��C�����]��w������W�s�4�T5�k�����j\�~���%���XW�x������m�e ��n+�T��Zn./������\}��C3�����4��tai�j���9WA��i����������=�@�{������y5�����'K�t
�k�S~�I�G������5|V�';�}^�9�k`�+�",:�K�%Y?��3�d�
8>6t�=��7�9q:��g�i��qzP�z�K|g�����,��4�����Mt�q�rh?���uvz��Y�{��s���gny�o�Tf�N�Oy�_���4UK��n���������D��O�G�����VM���X5A��8��9�v��u�t�ta;��ro����o�e�������8yb}l�X_N�&Ol��������&l���d.e
�����:���c����s�p�s�YIs��;�J�:^������u��5~��*\3�I������v�G�H����$g5�x�4�����_>>����������v��+3�/��7���w��)�|z��O��e�jt���3\6rJ�����s������2H}�����F���a�{e�Sv�z�M���������O�+W����i&,��7k��o~t��)�O:�h�z�X�3�A������������|r��A���`yH���{������qy��F [V�Rx����������Y���,����<(S��A�\���W�_������M=!�~BT���t��qL��~P�&w�&�����������ta��i:��fun��abPq���G3����6��� '_`�v���|�0I!�%aE�m�������.����g6�*�=�.N{�~��$�v|G���JN��2�9��D(�{k����Gl���7�W��m��O����������	"��KE���!��U���!$����F?y�~�az��0?�-&�9��@�1!k��<V���2���N]r�WY�Iu)a���M1��v�����LaP�U��)'���i�a4�����F|��q�H��^W~')�G��s����n0t�p��6�m�0U��:����-��L����
�:�k��K�n������p�h?�N7h��^���YdP��K����~"'��v}!��X�������G3$��b���h����oIF3XW{���A!�J��}^�s[�nV�d+�_a�n��u=��b)+��9�{�t��#���<j��yL�	G�>2��M��qu����N�A��~���ui�g�z�V�
��C��*g �C��I+2u���C�o��Rv{Rt�Q�D
L� �7�3�8�K!�0�cl��e5���B����j�����D�����l"$L$�L$�t���-"3v4Z�d<�<p��3��t�`l���l���.����RT��.������7����Lv)�8S��D2�3�K���k���]Z�A���V9(piCb38i.�*��)�_a-��</�^b�b�T�.1���X�B��5
h�m�R�(i����s���V�jr$�xHiy��0z����F{��q��s��e=>���X��qQ��5�
�y]����=�\�EX����/�~����$����2�;��l�q���*������P���Y=��z.���L����-Ju�����,��t.�m�O���-*x���WQ�wy3oe���"L8{Q�j�1�Q���gY]
%�zZT�*���~��@�s���_	���kx�U���e�D�<�9�s��Z,&��L�[i����(����r����Qn�M�Q��w��l��\I#U��r���'m[�I�3I��Wz�� =pr`�i(�&��y�6���el��/�I�uH%j:��T^Z�+����7��\��j����<E0�y�^�Vt��h������|N7��-;����X������3�R�L�Ji�w�8�����ip��'*��O1_��@�V�m]5�l]=3�x{���!d�nIkeS��i���'���YQ�S�	��S�������;T\���~8�;8�E6p��";V9�LZv���o�di�L��)+s9Q�P��4�-��sI��{��.������l�bL�L�&���a�N�eD�����!O�2��I�M������"G+[�y��K0���s&*�R�F=J�A�j������]Y{�R��S@g}/�1-���u���IU��Ev�X��ZA��A�7q�&��2������#���xE�or������}SG�3��U�t���m�D�q����5R
�m9\
�;\rF"HU`w0B%������������Vi/� ��dO�M=�b����4�0�������y���Bh�u�'�vl�D�^L����da�/��)�|	��N�U�����w��
����
��~\����\��:����\�; �]�8"��v��;��H
gI<:<��������U�����Q��i����X*�W&!``?\K�{[����mc��.��!��g��z���l���!�Bn=I^���c������J'����o�#��x��2��u���2+b�u�����P��C�����E�|��q,1F1�3�7����3��7GxI:�|��/5]�2Cx�QB�K�n/��0&�������C�5����9�1��X���0Z�20���C��ME:b�+g���U�!�������f�,���T�]e���cb����V�u#�_�hBw��/�`zy�p���G1�*�@��,� ��*�$���-����q~U���w����z�mO�P�_���|3��_���I��bR�m
(�W2��_jx,q�5+���-�~6c�c��:_�!Q���d�
]@�c���p�[T������t�v���
��cg���*��.�d����|�������?���
 �10gz���5�}��8�`��e�E^[4��+������	�q� ��� �V�m[[��S8�WD��@U��v���]��H�)y�@}���
�Em�JCu����G��0�,]*!;�LK\�o�K� ����~�p&���N��Y�����w�G�y�F�R��������4���5��L����`�A����,�}o��TW��L���L��<\Z�M��Jo5�0d�H�������oI5%We ����WS+�~�~@�f�+���	�$a�A�d6	�\�������V]��"V�J���/�V�G����Cq���j�B������ <n����[8R0�*��jq|�	D�����]��",}�����;�`.��L��n��L%'�&�wK+����7zc����ytE�F;Pi��{.�(R�O�et��N�9�~]{l
��;�((o��8�s�
���\����!��`����8ZV�2�?���u�
�j�����K]uK J��@�s���X�`t���n��F_�5g�C��������7�2V���M2>/}�[;)z�I�gyK*��}��f���@E��1�����S��������5
Z>��<zJm9��C)=@F��D�dqH�%A���mF&��Fd��c}3�IdiJs�:�����	j���xgC���dq�����<�il+q�$G�XzV���V Kh����t�6�������g�/|��#�C_�	�nY�;C"�x�� �C_�;��vw<3�����6������r�	�=��lh���6��������b���Y��*�)�Wg����t�[���jw$P�61O���a;��>`��9fi[l���9�P�c�?'|p��s�@fE�+��#�(m�^[��Ri8arf��_DM#a���������p��7f+��`S��A{�{��zOBe�5E~������D^�2E�	6f����iMN�����Z���]��]=<��(/mO�Tr�M$s��J9��Z��>��;[������s�����9��&�}������B�[���/b���f����>^U:�x T��
�=�%��/.���2��uH�^�����$E
/��n�����T�����A����TO���Z\'���t��y�}9q.�� ����V�3~�XkS��5���c���������_�N�������������+���3���pMB�sx�Bjh,����qr��'����8�
dw�G�d�\B�����n1��lEZt������V������kYv �uzz|�CR�8�.�_i��������j��$pz=����~���e���RU��;�r�����Y�����l���R�h2�=��3��R������,�l�����l�1�l~F������|�3���0ek�*{F��\�hu,�Zy����]5�������D�u`0�
Q�}�����a�J�;�);:7�?@���{���WP
��BF�H�����l�{X��ON9^�o8r�����)��������P����@y,��N��Nj�W�pX}?+��7��>~B �nI}/e
����Cy��o�L��������
�_KYH&������
�Q��%��������w�%���3��o�������=��m9�u��"0���<�1�B����;*J�^!�#Qf��9����]��tzS�L#<4��)���M�7��b8�
ho�M�CU�ZQn�Ms*tu3u�M�\���]mI���3����j�C��G�f�������_�4�q���*���ox�y�����I,����o>��������>�������
>[�����Wl��?��Aw��-�F�CI�� ��{������8RBXgGH��8.GrC`�v)kt�f��	�x�=Z���R*j1���B��@,�-S�	��p+�����P����&7�r�b�{��S���������>��+���|�S*bN�E��v��[������F����i�e�xvV����,i�����@����q��!���/��{,r���	�fr	�,��h���2���X14G[���QB������=v�0'��Hf2Ea2�s���6!t��f}��^l���w�j�=��[t�O����o�����=�+>c���d�7+l���x���J!
�04��/�H��.��~��V.�������q]��E�p�}|'���<�I�_�p��g%����`.����l���h����.
�q~���jB1���5�����s������b��W�h��/�z����TC�?e;2�1����%��h���iS-b�#��+�]�Z`����ZE�����K��������^�JI�s������^h�K�������N�{�8A�<�p�+p��r�h?��^y��@E�@�R�.��/�������)S��p����6�_PH�
��
�(���f��F~��_0Y�����?�����imw,jaDg��UY��r�x LE������u��=3p��Y�b��p/���4D�q{ �Kxt�u,.��Q]w��������A�
u����I����'�]7��9"[�`.��,�~[��F���'������LN��Lu�x�Y"Rl{[p�����GS����G�[D)�gtb
����'#�[�!_�UB�hl6�����d=	n��*[�:?n�<�z�)��f|����Ie��f����l�p7�;|!��u������Rl�2u[�^:���?������2��&l�xT%N�.�O�c��=ia���� ~��(���5���������p����W(?������x�;�"	j"��(7rPp��M�:��o�T�OR��������y<�^�x4c����E�,�Z��w,ZGLv�:iy�A��q~u�����NO���D�����ujF���$���B|RA�����}D���`g�����a/f��+B�����d<�����i��,�r��7r��������q[�$�)y�8G�=G�>�avw����m��M�&��X��?��F
�l��c�����@���5��g����C��A���J����0�R	�-��L��*�����q��b%gU!�w�C���������	��z��'����[f�[�?������%�O����nj���F�������b$����s�}W;�7�}9,� l� f��q.)q
��W���Z4�d]���w���g^��g��g��E�����or8�C�y)>��K7������AP��}���i�������->�qmG�RAk�Hk�B��\8�.�|�?�Gi��/3���oT*�
(��`���������qkE����~��Md}	{`�����?[���5���9sjY�
.��s����Q����\_���i�g`|34���t�vE+�R���/{p������0F�E��K*'�E���z�-��a��W��A��3��
�M�$���QE=��v�>�f��A��z�����X|,�����e2aS4�q��3�K���c��M�$ z�$&c��^O�*���rC�fpTX���B}�H��>3: <�����Z������B>�j+�D�-�Y)�aU��J{�^_�{�3Q-�����'u(��}�D�;{�
+%~PO�N"�:��&Zj�'31�D���b�����C�:NXM2�*;5�z	~�Q�r���j�m��4d��C��=�,J8����!���#�myH�����^��v\�M�R��d�j�;�+��|9��a�>0�2m����)r#x�^�C������l,yh$����� �]3:	w &
��[(\�&q�~>?p���]�=�Z���z_b';bnx�]L5tma'1t��/����N��4C�t4�>�/��~PW��o�;l�4r|�h�<���y�^[�SoZ��-x��kA�7+�'c�Y^0������E����)!(�+�\�o��M�������y~����`�������EL��T��'C���9�~t�?���"�x������l�	)k���� ������C�pZ�E���73D�>��g������xx�e�{/���?����o�Tx��]|�����%sbP�
��4�����k'�q���o���:a����D��D��-��y��^g�0��?NzX>Nv�^YN�����^�>��{O����!R��������4�5�����
K������h�6>�	K��/��,E����:*����!��2���*"P���Uz_%���Ok��Y]4�����f�,�}�q�w�#6|I$p�����������=@�����-�Eh?#������bou��>��-����E��x�1�n��v*����[���Z�e�>�!�����a����$��8��"�uA��W9<����Z�{�O1��Se���1t���9��3�S�r�
,�
�2ba�_+��Y#����+p�7��m�bl��9�@�����������3�6�4��U&t2�D����/�UUYD���G����lD�#����p��2�Sa3���m����`����G��P�81��/L��K]u�>�pr"5R>� [B�3�%��A�)c@$�U�:��6�P�`�x���qZc��5��N��s�O)�#�o�����3�o��-���	��	����o�t��5*�FL��^��b�����/e��M��5h+5i+��i�$���=��&*���Q���~��C��{\�$�a�a^~;���%�D-����o~����S����\lS���(�l���?C���\Z�`��!��!Q����� v&�6��K>���}')�g`�XWc`���e5�W�}�O����y�����g�6����;�����''�m�����(����W��~:���<��\�����=)��0s��K���]��w�������J�J@������\&���\�-8��w�O
��G���+Wj������N�oP�e'��"��������y~a�t}����7%�.b����k�'a��OHb>G2�(�+����,�����L"�X�Y����r�.�wD|�G�Do'���\'�[��EvH��b��qui��Z�9��/��~w�`34��������?���eH����^��^6���Aa��u�<�H�����&HY�&�������@��p4*_��)P)�f��lV��i������A�d��P0(����lyt�=f'3���EP��Uyc�6���V�!��}X'�R)�:���bwuB��ny���ZP9����!���
w&�������k�4��Y�4a��{�:A�U������t��!�?n�1�:�<JD���&�@�.`8�)��*L['�Fwa'���2��q������}������K5]�a���a�W��IR���Ro�`�Y|��D���m���Ql;�r����e��jO�ig���?p��oY��}�����2��m�Bw�j[���R�K��J��|���[},�)��g�
dv����P[�7,�_�]�����e��4��QZ��t��Y�����D����'$�$
9���QC��d[�	��|k�G��*����H���N��-x�4����iN7#��x3��f���dX�����v#B�<��vw6"���u��^�n��o��w�2���/B{az��2�"S|��p �U�����*o|,�/W0�T���8�l!�T&���������E��<e�:t�~�-��6���B���g�+�m9!���6�	d�B�"���2��Tx�/�-�?��FmV�����n+
Er������~R�
����M&�M�J!$��`�md!���dO��D1>��7�Fa�l�!��1�g{�~��������]1�(�����6��,�n��"J;����Y9�V���������_����m�������o�S�f��s�����mD2[_�P�<��+�]b*`��2���X�~�L|�7�`�W��^w	J!�S���S1��!��v�rOS�Yd����6/�[���}���<���:3�����x�&��L�������������f�����y��f�~�y��m�'k�E�\>���~�>>�y?��������0��_��H�)y�������-�an�_��G�z���+���Q>��a�0K}���A� ��a�X2���f�|��-��N��U}A���"�E*����0�Bw'E��M��P���������o6q��\$5�5D��-���BQ����R.:�����c��sQ���9�N3��&XZI������9	����1����<�Yj�9���9dt!��������XV�mC��mD��6M��#�e�F���������~�;"��j.��m���K�E���/;�a=���Cu*}�)��ge�=S
��xV��\�
�s��J���Q\@�:����o�[r���g����+��L.s&0�,��������S��*�z�p�.�kF��n�J)�J��B�*���p�l'rR>(fd��=y�U��0�8kr�����n`n��IL���|��?�Y2�������h�0���U�]F{G(.kGf|ml���
}��Nq���NfK����86�%���a�+���E�!q�d���������� �r�w��F�r�����_��_�ygV�r���pIIW�����P�	��ri)r����|�bD�R����
bgc9�������>��Ap�z"��:��V��\C�1!�-a���n.r�=1�^�m2tG@W�]OB}�rOP�s,:"���xi���K3������$;3�jqGS�5�"�J�J&��L��Pn`O�Y�-!v9�	�����h�M�����=S�^m�Z5|��c�A�b8R��C;�������Y��"t��+h�Wh�)����(�����"%�u���k�W�����0��:���y�y����%�w�_��+%pC��w+f�k���� sCz�$����������j@=�"{����yDX.�E������=* �dh�L�D����0�9����ey��0��Kp��zG��)�����d���[_���	�����X��` ���z����S@� z�%��������5$4P>2<o1�?
S6�t
]C~H:�rC��
�7��4a�)E�P��}Wlxk��2�{
������/-����/Y�T��J�*�.[����z�y���E����g��H��������9�����^���J���X���������>�;��#��!�l����?A��t�U ��o�q�Y�������h�l{7jk�"%��%+(.����^!~~�+_9 4T����[os�����\{ ��L����f�|C���������k~�����u����;��������ml��P��I*��~��O�,Q�1�����H�w��N����+�7d�)l/�c|]��Z���-f�z�9�b����yjU/�����������y��Di����ZZT��
N�w��O���O�X/�*���Z�o_�m/Yi����(���/� ��-��t:L>tEI�B��������g&�M��}l�kYN�P���s���b^������]�#U�z*p����|0&5�3y�$',������D�Q]�������w\Q�d6j�!��@���#4���-��m�A�w�%e�7��!"�>��,���`z"�v��+��S�x�)���@�$$C���]B���*�@�k��O�. ���*�@	D���	���Y+�v?f���,��?�����"1�0�{���?6�+��tu��]�)�������>\ELn�"=�0x���$��M���,��n�5�r��5Q��n�����g�z�S�J��6�������b�4�O�����a��R�"���l4�����M�(����E@�
��Lq�Q��-7��?)����D�����L�-8����x'R��"�3�}���
�np)j���V?�w���op�_Fe�����I�w3����<��Z��j��DvY�arCL��?e���D�w�����������B^���"���4oS\�B��C�����V\IE�
`�02�U}��CBTY��^L�����3�cnE��[}��Yid��MI�ip�J���Z����O:�����C��Oz�4����t��w��zu��#HT���H��H-���vC���?"��^�*��&�%��%8��e�z%\y�G��V?��WG��>��B
K?��y���L����}Th�r7*/J���#����J�p�!,T�����<��DQ3����w��������I5Z�%�ut*���\�����dC���t
�I�x�w�y��m$='��rn9��QM����^��*V�u0�����H���J	�H7S�=:�;�`T\�aK�	0A8�I>rz�P�g�����2�%��gy����a��d���C�� ,��E�s��0n�x���5;�.<�XB���h?H�C�R��n	���x�C���n��_�����_oJ^q_"���~���]��XSF���h���q�'�H?��B�����5�"�\�F�f�7�W
X���r���<9�*�f��
vW�<�?�x.�e�=s�������"`d�����}$O!TzC=�m_��)���,����u'��@�s�7��?�Y�l���f��;ai(�S��Oq���Og_L#3��d�z9�|��	:�?�/��C^G�_N��V���n�y~�a��M.�GSjQ��7�5�I��4#%��-f��_���9�V����R��n���gG����I�������cG�q�-���3n�c2,�
�kA,�S�I�ed�%Rb\�=���E�m'�Z4��2` &����n����g�(�NJ�~-�??��
��9n0��bJ��,Q���d��z4��d������}	�r�1:+����$�������y��Y�������h{E�b@|�6��rt*r2��F�c����\���=|R[�M�������J1E6�w\�t�"+[�	��{�����{��Bl�xj��Z���j������1��8����S�(�\��E���3���}�J��s���m����\�{�QZ��D�>��4�j)>Gh(Y]b�:����6�)�_@/�q/��&QO!�9N"��ynx�~II0�:����������p��&��6
LB�����lK���d~��y���nI�G}Zp�yj���t����3������IL��UZ��x��q��o��:��oY�B;ll
I���{�_;����
�?����
i��W*��m5��MZ��Y��������
����R#��.H:dI$�
.����F����3�P��;���m�;W�DAAR�]OV�yy��K��y2�G��R�a�zL2{�?i�<��*H;��nNA����b�V+��XAn��6:�����#l(�;g�	�m����������������b�i�m�i������}�30��q7��r�M9sX��h:M;���_���.���_��F�7W�������=.e�����b�5��`~Oq�6��p����������B��^���|�f�&,��xIx��r�8�be��;�3i��\�/ipa��u�0M�%[��G�.�<x?��7V���YE"r��\6t�qZ��q�X��3��8��3���N\l��RV�Dg�(%�T����f����d*�Y��76'�9�%���������L����
!����J�v���y>6N���d�����%^5�m8m}���{�/���}�)�y�f�\���y���12���}��p^YA�~�lo��h�e�;����5?��f?$I\�/_�.����B���1�������2mLS��oF6���^������Ih����k�r�'BYjg����(6)���As�w<��y@-�Cs�&)y�?��Q��Fy����a{HJ�%�����dj�'A1^�6�M�����%v���������[��~�3���YP�K$�iy��
��������T�
�����gc1K;�Wx�C�>���5�P�u�o6.8����.�������7��/{So>?+SY|��P�\���*M�f���,�(j8;������M���`�C�%_�V������� �H��Q|�� �6*��z�~7N5\6���S�)a��A���&|B*��;��/�u�V���5$���C��^�"��_hO�D����	M\�A8�&���{#�S��������(>O����U!d[l*.B����p�?���5�y2�Z��(F�4�>_�����x���Dj�Y]��s���sH��'�5�^r��]J�0��,:�m���9�KV��n%�A"���
+x{��/���P�p�l�p�������t�>����Y�ZiX���/���72����h���
�q�u���������K
��e6n<����c�3�aRs���~�7zE���a1�pP�,����x�~����C��o�|"��������X���2m�#���b�4����??�
�'��I�m|��F�u��!$>���M;���(�GDQ��b�e�_��9UT��a�Jg�4����'m�X\��x�]�V�|�J ���.�5#F����sh�YvQ�K �������qY}\}0��np�k�0)1���I�kT&,a��U�{[F���6���Df�`��&+���E4���.R:�BW�_��������>@��-2�1%{�`!P�`�'��G�g^0E+���i�)U�
���z���i��ctQ[t�zJ}�A�����s���>�����G$����w�����E��p�����$��CJ�=O^����)���A,��6�fr�g��J�+f��V.���y?�	����D.�T��u�3�+\��y-��In�s�4�Y������z��5fIpTQ�kg����Q�W�R�l#<��YH�D�P_�ac��1����& 49	��!�*�>�
���oc�������q5+����[�f�-�As�����a�6�kg�����������O}L���0}�\�f�Sf���"�|�_u�����c�%B�,�t4Tf:�^����Y2+Br(�����9	�\q�����*���YB���C��p������)��B���'�G��&(����\7T�~��/��<j.���z���I��N@�{�0��wf��u}�2���.+V+R���Z:��F���D���/-��R
0�����ETtc����#,��&�� s�C	;5z�]�.��$Y��W!�r���`��C���\������=/�_�U*0_�+�����%O��B�{8^pVa��������0g�>H7G�"%4��e�����tL�Kb��v!��t�u�����ZH�� Y��{9���(���5
AZ��,����:>��,;5�����`$.��+�����%h�e��R�R�P�2X���5����BN7^��� D�9Z�1�O;^�`��e������y^T�\������B��#�$�Gwq�k�����a0e��WH@o;�"��04u��;Y�[s"���|Z�`"�]T�V��Z�r��M4V�Gw��@����G�V��7�W�H��@��O�g
��O��	���D�m8w�����G��lh����$���WM=R�/�.=���;�oD|����:�rH`��.�����X�,Y�xH��{��C��� ��������^��K��I����+�� ���*_��
����:�������oa��M�_|	���x�m�L�-4h�	��[�_D��Ne%�d��@��c9L��
� e6���$MC�/�@r)yY(?�k�I5Z��v��r2���_�t0����~`�Z*���}���d��=
���e�|��9t��n��e:/�%B�������Bm9���W�qEL"��F�.�M�����r2�)WR�������3@]g���|��^����^0"��e��>K����t��P�s�i�%)�<5�+~a��3X6��fF+���4�sh����H��Q���l�j�Z_t�����I[�6�E����w�h2Y}��^��,T9��OBd���	n��je����� ��:�z���.%vi�D�t�
��������`�I�����-vn2�?�1he�'�����Df�U�>������b,v�'���IR�]'A���ft��P�l2���4t�r��(*�������t�7!�G�8]���� �9��\6w�b�����&�1R��������]<�rG���}d���._\��t�8�i���A1J��������'��'O��8�L�E�c3C=IVi ��u�B��0rD�,F�����	S^��Ed�m����4k
�?��'�K[��t�����;��#�&��[ ~�1�V+��%�m�����k�i;SxRjG7���;��������!��l d5�:����R�r+����Fd
+l����\1d��(��&�����7.X�e/�����mO� F��t@l<-��5/��:�]OBg@�;�x���e�kG��]����<���7!{�wtdU���q��1s�

'��0(J�+hS�h��#�_�b��������*
�H�f.I�4mJq��[���R�3q�
�[�#�e{��w%�}��������|m���;�;��?�V�PJ;M���,�<g���wh�������j������gq=8x$i
g�8�CLv�������s��b���Q�F����S��{�/IW	�BJl-�Av��1����8�w)��Tk�c�7_��)������zm��$���qS�"@�MaI���M�P�V�������S��C���~��b��c}���f���xZ��{��_=�1�\�Q��bWO���Y�TI2�9�v��y��:A��������{�?���
������"�]K��]�����q�W�i���<kly/8������o7L��� ���\�4�m�Z����N�x�}{��Y#��(�^i�<��w(!U�.1G�8���Le���(Y5�T���|�VU�J��3l`��R��_�@�]�2v5������~�#%�_��q~�f�h�Y����*���t<hq������)������[Q,��r:��@R�l�+3��fnt���`i��������>/�v�����YGE��aB��Bz�[��o��3fl �&�q���Y���j���$Z��x��8�)J�r���*����O��D��%R
l�2�O<W#p4��M�W��k'Ma�����|"&(Wk?2�$����&�?�f�,�&�=�����H(Z!���;T�w��
�� �Gn���.���4}b^*O��.�FW��De<H��3�s��ok�r���c���~%��V�j�x�;_��^�i+a����j���;S�"{�.�O��&����!����6Gt�rtN;~�L��VD3�����e���1~-��M���(�`�-�7:eN7��S�����"��r��*�.����s�����p��N�~�����y�����$��LA	��r�e��(������<<�t��P�Z)��	 ��������}��~��\tS	3�78�_���7�
��^4/mQl0�)��@���
��>�"��Y`�%�1JF�Jl�C����E|fE�����qH���^$�
`��r(!Y�M=������FJ%�������@���Z�.�H�+���|o�� >%K%�@���]��+"�������<�?�?T�����?R���vy.�E��r7�,�����rm��"��]�o��NJU�</"������:X��|g����[K��+����J��a%��B����!er��t�������K�+[��=��W��OB'+�*x\���M�[��Bt�K���L��� ��������)����{7�.e6�-�3htg��MW��Lp�2�0�_G��%��|Hy����l4�3�Q����Dq�H��T��?a�^���!s$T�r�Ru����?���R���*�~����<�aL���0�8����6b��uR!�u>���F�Zw��#�S�b�[���UA{b��N�Hp��\4�,:�bx^�|�����������|���v���7�G�o�,�(*�9iD��D7
�7�����)��$q�
_^��I�U�t�2#�G;x���-��1{C{�/����r?��r�1L�Y��%�^�O���+��1��(�<t�F�������7�K�,��rp`�U����<��P{�3�+�O��*�������I�w@V\�H[W�y�����>�5gz�f���p�5VqR?Ex���	���u��,�m8�����S���._j�`��w��zm�e����n'|���c�f�KB������TB~k�@����&j+��T��-���!��EZ0��`>z'/_,�k��`��&D�32]������B�o�7C���a{���w:+M��P�Q�2��]q��
��m��y����V���r}�R	@8'�Q���(@�|6���C�Lo���8������{ ��?�jA>��q�������dq�C��,$�n�R6��\�Q��+w:���K#�.���bC,s��aP�@t�K�=���(78���^Z,$�N>I-zx%�����]�^�{!�eP/tXTgg7�"�����E4C��}YX:`�@n�,@��2�S��a�I��%��>�K]����7���/���+�/��o�,�!"�.D@��GN��^E5��nT��~>��2���w|���,�"��.E�.[Y�"k���������c=��|(�>�ix���f�	P>{
+)!�O\�~,��<N�����:	-��:|X/���J��HY`���PN\��N/t��p(�i�9��Q��q��z|')"�K��I:~��HX���3e��A2�m��a���&
�'������v��B��zH.��&��z �Xwa�;	�N�wc K����9_R4��/���8�d�x��W���~��������t�����]	BO�.%.S����r�h{�;)1�?9U`���e���r��������)�$��L�������H���O�{(�j�]M������J�5��V�d���BV�k��w
�o�p�e{Kd�gl�R�%>;�")���/�����}p�i�Q]��dY��je�E�n?��1F��v�Q�J
<�I�)sh������@_Y��G� ��B[d���FnRp��9��2�L���={����N��������L0T���k�����
F\)�T��o�N���fqB���6�&T�H����JaUS3��/��[�J�N��G7�5��6����(��*R������/����u����;�_���0��n��b�Rb��!9'�X�~��x��g-wX������y�4S�!u;d��Z�i0?�d~��Q�Yo���8�����dx��w�������������M��`�J�\C�Z��~J���4kp��W��F�)(�(�V�����=I��a(��vc���N�z��?�?�o����L��|bf,o��6�
5)�-{�@Ux1��C�Tk��$�����i1�eUl~����-�Y�R�T�C�N���j�����wwB�H���x����H?�>Z&wqmPO����O^���r��+�xi(�����O��I�ES�-���w� +)
�(dE�����8�y�,+���(��5<Ck����)5�9]��/��9Ixdw��P������T���E3)0@|B�B�������l�8x���{�h��'��/�i;'=��%��~<M"I���j���;�T�*u���
���X�����]��e���'�����Ub���74�o
��ag"i:�p�	��+<�M�8�-�(X	��-�7��-�~r<��u�nL2��Oe7��s�c�
��@U�n�����j�7�#x�dB��1f�E���R!�S��Y�����t���7�����.��!��b���P�������J��4�bd��]&�X6���t�r|��q���z��?1��VfU�"9�h����O���b��=�^C�}��� q��>�/��C��y����a�����o�)��~*�p.�#�VUk-��	����x��|���|���P���y�#�>�����w4�K�}�xCc�I�=�=�Xg��yHq�'����u��~Y��(�s|V�5��T���.�~��M%�x��-S�����F#C��������uE��~�$��>-����Yp0�W�a�I������e�������U�w��kH"t6���4p�t?N�����x��2��V�*�i2�k�$�CD$���~�m�e�����vPR�}���,,��D8��a��h��\���"��1�>a�M�B����J���zs�����x=YN84A�=��2���q����}|LA94�zQ�����[�~�m���	n|�>:ES���>~��c��3� �>[�?���>�%JG �)W�4���p���Do�,��F���i��[��g����c*U���D������O�3;��n����j~��4�C��J������i��f;��>8���p��"xp�	)O��h���l�����R���M����{.��ock�f����FT'�HA��hi�A#3�&G�l�C�
R�s	�*k7��+����
����`����A�
V%�3����>{J�U�p��v����z2e�)@�~I,3��1��@EIzh�q >(��������2�>��`��f�C	��!.��,p����W�X������l[�K�N����EH��e���;�Y�����q��zLO���K��j;��`W	��:]������"c�!K�-m'����n�����U��p��f����O�����9�H����c�I�}���p7=g�u��J�!X4��G�L�����v������J��R�����n�����>j)�?���e� [%Z�5�?H�V����t�NE�����'�:KDg��S�������*���P�T�Q�up(�2���gB��jev���9oZ�S��p���}_��:��X�|������i���)6�lc]��a�Y�%�H"���d*��@�pl����s�|zs��g�����sI�N�,�m��T=Ius�����y(l����P��U����@L�9�G_�C�X��^������
�O`9A��"c�BJ�	�����XY�]����e�������+@���k��r������y���Y;���������,~���#��`�+��y��x���;�<$�Rv�:���
"�w9m�kMf�G}/�����������t�LS���O���>����7{�iH��Z��^6�Ut���������UQ���9�0x��GX�����fU"���0� 0^�1=�.$�	���~�:�'�"�����e,��L� �?���v�v%�5p��2�3�eX��@����'���.�����Y��S���(�)i��%�&�f��7v���b���Lps�(S�������y<-YdQ�:�QZ�(Q\��R����ap�����,h�������#d�-��_|���rS�K���Q�v)��(h&��Z��}�U������P	)�i6QO�l��!�G��O�+����K�I�c�����6a�(}H{���|d����4����s��|������Wg�Z�%�Z�@L�8�G��"�"-D
�"��
�a)�DlI��F�S	�x���r�H#Z�-��x�N��t=�r�>���y�Fy��bL����=�4���|���<�:HC����L�cRn��]�d�i��9"���0��E�r���aK2�N{m�{8��W�|�����F��#�MP.��Uk�|���a���.�	3�
]����q�n���x�uF���?�1B��J�&�p��x
i�
�bue�U�p���4t�%�x��M�����X	G��-�9��������p��X|\��"f�},��p�]]��,�g���*�r{�
��%���|����F�mSf��H�h��H�%?���x������=��g~�~m����(�_�0�t���L�������V�����P�]g>����I���r�w����H�&���!?�?_B�i;"ET�E�h�D�Z����-g��PG�%/����	��v�;������Cm�y���6~��d��*�h����T���9v~a�����������"X`>���d���`�DW��D��zOO*|�RmN�U��8$����u��zc,��u���P��P�	�s�:W��%��sE�u�mm	}���G.\����g��?�`V:���]8�^x1��Yd�,~��#x�y���Q-�$,	O�����^��8�#6��._ {f�2E^�q�9�}�K�>���i@���+*��n������Se�j���]S�<��A�E
�)���[Pc�������$S5~LB V1�t�g���p�bw�q��h�6�~����I�8\0N��9voFZ�x���
��v����d���,�'���Z���^w��M��6���d�[�����e�;���$G���0{9��D������
%]��.�,~�8�����x>��5��x���8��zk)`�������9��h�����#���(����oZ����B�����l�������Gp'���(-L��k�Zxr	��nm!9M��P����p�#���mc�j�,����:f�a��s�Y~.��a�e�p�F��n���I�����^�{��F�]�����f%�_C'r�k[oA�������tQ��y�v���rzudG]��q!�����������������#
��S���[���p�
��'�=�n���w�j��p�#?������9�q���t\�Q����?�������4�nD�r����
��)�#���;A(�g�{�Z�������p���f��.(X��V�	,#���5nF�@2z`�}D���#n��Uen0A���\�d�_g��ss���� 	��4��a���a���ia\d��\l^h��+4��'	'��BF���^Sm�%cc�������g����v��.)D\�\`���Nk��4�"�W1�;>�Z����8�_��bn\K����u!����`�����2�B��Dg��)
�����m�Z��cYr�C�&����t�9W�\�D���WQ������>
�����e���U�	��b=L!��x����^O������������Qe7n�u�.�2��i�+�+�<}����;������/+�a�B��sSX���d1�:-����q�u$��0��]��9������A�ya��!'����������TpF����zr��dY0���+��K��X+~D����e!�p�^%�C���9�����zu)�w<�#����X�_x��W8�>41�s��n�.�:&�#�^�%>����N]u���[bWu���U�
�d!;���]�������)���z�B�a�����&-zF�^SR�h%��/��b;c�9�>do�RN����x�z����������A��t�}!�B}� �3g����b5�����(?~����l�~_�����S����ETQf������(�N�[���NJ���Do@��\b��za0�������l��o����4~L}��L���Q����L�M������S����jgQ�K���S<��� s�ay'�C����k+��m�)(������E�����Z�8���eu~"p�����Z�6�����������J���Q��K��{�0����c�T1#�&�>b��k;(qa�����Y���=/�O\��H���\
����>�Z������	���B��V�+"�4A�����dn�_��J���4��;2�P1}1�������Q�9`���r�#��y)3�]����c��e^���\za�����fZM������k:������$����h������3�9��`��{�t���XS���X�$��h����r�Q�79.tzvV1�?�1��)��
Gv9�"u��}�sL���5?�3����?���>���������?�M����$/��	h�1PO����,�*�A!��9�_�wk0����;�8q�W���j3!������7=1}�P��O�a������\�wyu��Hpq�����m�1��F�!@hH��R�k����df���wa��/F��o?/��v�h�%�tFu���x�e�����}�k"�c�6��lpZ!Pe
y�-�U������k��~��$��MU%8y��OMX�k���X�<l)��������%Zb#��w8!��&���/��B��0����������i`�����4K�Lh��#�`�|������+������eo������'�D��]�5B��N��!�{���ip�Vg~�M���A�g��V�Z��k*�?����L����*����o������;1c��V����!�G3,r���
p~d��E��zZ���3S�����`\�H��!��(�=3�<QW������i9X%pc�b��?tjP[ {"i��e�����zA46�C��0wep9��m�\%�Z��k�/���P����!�(�������	p�mF2{8�|���I��������EK}ujZ������qw��^�
��K���G^������,��O�����34��gh�{4��1�yC�d+m<O�x�B��4���b�fn���^�cPA��W3��	f�
��>5�PE2�:=Rm��[O1V�4R�P��������;��X���*p���C�u����q����+^�
�W���9�3ky�
�����S,�d��q����@z��wT%�q~:!�	�����(Q�& rc<����I;��1�`�.6r��g��)���'_�A4�z&��Q5O����"��;#4mJ��J�X4J��%y~Z����,2��
mU���O����#�qbFU�f[�����N�:-F���55�:������������7
oa(�K� �l�����j�B�&��k���{>J����C����Ag��.��L���������h�R���8N8oh�f���qbl��g(A�a?9�b�0�r	R����P���,�W��m��O��<b�K������)��`�7�i�{n�j�s�l*83��6����i�8Agz�u|"���-\^H���(�*�|�>�:#�&UQ	�N*��@����eW���b���}�T���H�~���8�/����y����"����>����7?8sV?3����%t����	���lW���g�h��'zN;�)�!�U���1�g�d��d[({��}tX�z�z��� ��NS��%1
#X��m�_G@�#��<
����/�d�����.��?-Xa�9\2�1�,0�N�b/HF��0�<����Q�q�G��w���������\4��8)�W&�p��pR� ��(�>C��-b���k�>Wu����r�S��rox����P,sJ�!N�[�e�6��m�����?p�#G��*��:�kp�,����w��d��0)�B�OGso
��kqw��f6�7:��kQ�N����w���dU��%� ����q��sQz�
j��^��( �@�������
1�i�>����M^cV��P�F	�{�~3���������jX�U�����E�M�qp���g���mQ`�������S�q\o���`n��k����iNr�q���9�F��nu����F���o�M�.M6����8�a>1%���k�k3wGe`��2���BoB���M�3�qqV�3"G��}?��Q��G����4��<=���]it\^�	#CC�����Q�~`�9�/E�E���+HE�M�n�E���z#}���~�G+<7z��|u|�`��9���S��F"{<RY��j*{��&�)?��Ty���$i?@�1[���'�
<j������^`����N�Q��x*ya��c�fO��+&Fu4���y��O4�{���_=��1�X
 (�5�	����Z���\���� ��Z^`K�����z�r9�8�*����5*������6
�e����C������I���I��<x��I2�f���������8s:'�B�A�D����� YK���e�����jlU{2�����
v.]Inznv�X���t/�}��J86������U�8�����a�%�;��Y���!�"�M%�
��,�r��O	�-�v�M�BK�u�~��0|��./��9������,�i�o�6��%�[|�9��PF
MRn�i��S�	6��������^Wt�>{�y�����c��I���1����E
Z[x�=���u��r���G���~���L
��M
�������%!	���hpzk��^	p�]��(@t��@�����X����;S�2��t�>\&w�,D@��>��,��Z���N�^��=.��z����B����������6�e���)�po��5p�H}G04��kE2B8�B�b��E^�Z�}���^�>��$�\�AP��R��
��2��]��1��
I��R^�������G�#�PQ��f~�	��A����|��V�4^�.��+���}/Mf������-�Y�s���3�������_��;�M��O�~���������!-�M���c�|	����*�:�H���?����J_H0�%z��HB����+5��Kv0�����A���
_s>���y�����g=�x�K����U/E���	�(�t{?�;�A�R(�S�.zNoQ/�����i�~p+`���R�m
���|�Yw��3�&4�+��I�������Z���{�PkA�G���f}$�'5#���-�k�G���,;o;_����v��c�������p��f;�+>���R;�����5���,nf�$c�:�#r����{L��7����9���<A"#����M9RwSwL9R��o�y��,�\�e$���[���Pl{j�B�B
URA'
�'�Rk8������������BG����-F�<
&���`{w����u�� s��[g�
���*:&7��j&E��k
z&�z�2���9�#\�?��'3
�^���_~���Fui�x�J(C�����?���p����n=��=��R����F���V������j��)>i��H��#�h9�{���2AD #�yB�������D#/l�����MN�y�D���G��o����x����f:t�
��"�7-//��i"��oa���9�!<s��ubH{���PX����@�_
-�[8�V�e���0�a0}����K�un����I-�p��D��-��G��5[L+RL��6��,�Q���~�!�K�k�#�-P��C.�%�Q�he�����\`�q]J+"K
_�8q���3������9G}�>�J�Z�4b$p�Y�c�4��f��X�+O��� �o�>�I��W9:'�����r���!r<���O5E�)P�8�M�-4�B�&�O?u-�B�!$ol���u�f�������d���J~5k�����
�<,�	v�����;���J� `l�<~�j�![���H�v�o�����Z'z8������5���m!�>`JE�R����a�0��1��������?D��i;���a����Lt��>j�F\@����$���������)��N��"a<w����8�Bp�x=\��w�^����h�<`��kgz�L�1Z�����w�s�+���y���Mq���^E��~q�/��F�l�Dm����,Q�A�g�����a3#o��3P��rN�U��u9>9�j�+�{(�u�&������MI���aqjp�4�S�9��
�"?�����C��&�/�=��4�H.�Ka�#��M����?������+�
����s��g� ����~�.�#���=B._%>�`�$�?��@M���|1UGv��M'��x��%��y������n���?�O�\�"u8���{qU=/qx=�Q�=�/���6��������(���x=�
��g\|Hb�c�M��,�;qj<�N���,{��x�N����/�"���}g�v?�������H��|�k��5��#��:S�����E���]A�����9o>���N���fu�f�D3����7������X}��R?��������W��iL/�Ej�H�M���M��{^S�����G�S�����GR�?�-E�9$?������[@���W$��S��E�d��ooY@I������M�������U>r�E<����q��e{������d�f�Wl���V?�&�16m>�z�a`Y�KB"���W����|B�����z���8�SJDF�c�.g&-4�����s#�����C����&Zvu'��^4���>Q����s�b_�h�v���0�`��i4B�+@S_�o*������|�������>��
��m�Q��LB�Vr���;�z.�a1���=I��Sjl��SU
�G�!����[�x���E������>��)M�~p��k1��.S������!�,6��PZ����<�nhZF�����7�k����'���������TB�T��=��5
��*R�Y���y?�m�`����}�T^�������;(<�"�9�Yd�7�~|2�jPqj�|
\���,�MTI�E���l�����-�)�R��(��o{}�_7�����#*X�n-�0�������������`����;��7d�303~�v�����/u6G@�/{������
�)�m�!O*�D�� 4���1���?�G�������
������y:������&�T���$Y1����Y�lF4����wf\s��Z�=�	��>��]k��&o0�b�7*����������?��+��A-�?7�k�e����X�oO�����;Je�;>��ErOyZ$"5����q�������G��HJ��*qR��Ej����3V�������������"�	uP���_�����g����\iJ����	*]y�x�=������Jr~�E����-d
�XFVV�%>?O��b�"�zx������$��������-��)g)u���/n���z�gV���r���������O��pVS�NMX���RO���K�(����k�h�g��L!�����*�_8���s�;�7��� 7��FX{����q+z���A��S������u�4~�
,���������
>��{?M4&���#��	�g�4�s����8�vf�o���B2R����[���1G��d��k���[�O���Qq���?o>u����b<�_����j��P���G�Y=�9�D���{g���d��
�6y]�m���p�O���uOg���L�M��������h-gx�z#9 H�n%R�����r�N�[�_G�F�<gy�7}>���Tn���~���R�	@a3���N���Qi�g�j��nz2�����:����^�������7s�'��
I7aR�a�w%�i��k������E���w�}��`�NT��	��q�Z�Oz��*m�i$���v��/��F��#/��=��/"�x4��?L����,�S��Wa��b�8�����`of��N����,�@��������Q����
?v��Nt�2of�?����������X�i�e�T$��dd��;���H~A�
��l�����Dw
t�Y���<��g��@�G�}�#h����������zx�T(U���-GZ�����`�v5.���-�`;�qo�)c���g��nT�]�
��4u�����%��i�����g���wi��o�_�&��">�-93�Ufx��^H���~!����N+�oi�9����s��$>P��vA�QJ�����5�1&�\y����$�~%|��F�Q;�,/��J"z_x��Pt�!��4����m���_������������o<����U���)�EI���_h;�KF��87z�q��|����������?�����X�v��{=cON��r5#2������L�8C/��5�}v�4~���P@3�
x�
�4��cR�z��r��.�:���9�Y�$����p�^���-*��w��&?�g>o�7G��g"��
--Z>�����Ol>�8��B������������-��Su��b������$fb?j���>�$�����'�G`�Wf�S��4��D�w�7�WT>`���m24u�o�7o'{�}�����G%�
����Eb0���Ga����
�PZ�_&w3o�@G��
�d��J���B�#"�����|��|������*/��F��
�ci��>0������:
E��"F2��t|'D�
j�"��]1�:����,�+��M���\�B&_�KL�q����}0����%2>��T���������,S���e��M#���=0:^�-�.�M|T|%��9�n��V����=�|�'V%uC��I�������}����`��nsP���s�st��-0�?���B�9O�5;��OwY��tl�;�7�(\�}������C������[
!YKq����p+��~ �_��
��f���4
�_J b���	������]���k�������q�i��Ii����;DJ�NA�.��a@D��E:e����{�s��y>��{�����sb���^k}��k���?����A��x��g�b1 �
\ ���a�?�!�#M>�/�Obhn�4?!P��E~�W~�	�C�O���g7aG��8z�+�����)k
G:�[�Dl4���}��}?��Pg5���|�#���wS����x�z��w�};�_���_)X��'������l�������ou������[����*:�mV�h_A�I���5F���$�~��7m�p�'����8
��|�[(@5���-���&
���6�<��.�/L������7�>��
��o��!�������[����1��������#0k8���k�5�
^���	�Q���?b?�F�����H�
|��;�Y	B}����.�G��c:���0/����[=>�aR�
<,��*��7<`~����o'����-��b`�I�/������`��W+��I�%(��EOO�A6���@l�WI�s%�.�d�~���4_s�#{�����������g;��H�`�z`��3C��bm��}TFr�g �@��fp������b�cn�6^���L��U%�}�RH�l}`S�g9�����./�m����1�������8��l�6f�X��L}&�z���**��DL�W>����W�����[���$��H����������S�w�����'��4��q��6:��r�?����<��`�D��
U�t.w_���_����Y�������_��#NcS@�
�O|��G4�9�
u'�~�-���hC���^�[���k.�e*�1�Y����9����PL���[�uA��$��Sr���GSV�7:��c;���[�]���@���-�M�������b����i����Lv_��n^�i�J�eC��G���Q���\f!*,�eP�V���NR�b�L���#0��D}O�9&�����C"y�`��g�1�)�q?�������� i��ti�m�}fQ
c�S4��,�3=t�B��G�	��3J�s����|����<����|*�b������H�2�J���������9����c���~e������^17�=�^�"'g3��w}	�x�����4�d<���;�X����D�|��h��n�nHW�8.^��������J������Y����@�<��n���9CC���c���0I��Tr�w��R�&DV�=�JZ����c��������>w�`�5M��
���,���3YA�Z��<�;�t�y�<�@	h:{G���������*u���*�D��1�����Ip�yg�BUM���L���~�y���;j�q���*~�0s�_F��k���/�f�Lz�e9�M�o[.�����h�$����P�u�
�Lp���?)�r`�aIA}z����TBo�{{4mP:��:Q8K�y�\�W����Z���������?�BYi�;������<�F��t3��rW�J�GD�a�<�H���c�Wf"������>	h����6G�������\�(n�)�^A���co	��u���qz�+������C���[~��[}�c�C�R���>�GoD�����~f��R��~EDU�^e�h]���	�����N�x��].�	�b���CF��l=����������M�����]��m	+P��AF���kj���I�%�y_���&���A��q��K�%s���@�t��py�|J�/5�����;�Yr����8���zl,�{%���>;V�&x�b����������� ��+!);�j�*[�h�AW�Pj�%��T�~�x�rV7��������;Z�-�D�Q@�<��'�a��������+�g�-���)�j�w�2X����DI	�d��D���_����L���l���fgP`�<C+�~������d��������3����)tbewr&nC"�^"��	�����Do�:���5
��!���.C� c|f�����*03i��qE�2}"���
�{�*'�3�K/��J��G�R���&?�vA%���f�n�\���d�����S�GUr"�!Rv�A��|FAy�E�~�K�����73"�?E�!�h�8g�6�W8gg$.���/
[iuO}�zFlh��#\�'�����E��E�����3��o����"|�Q?P]��^{�d)m�o�Po/�-�t�Z����M�Q����vt�n��$�I�g���k�[��1�#����)�P��V*��?�E;><9��/S.J��UX��yeA�1Z!9Z��������i�	���BX��N".���|��G=���}�L��	a��z���f�>����DG�:B������^
��Ll�f'��a��P��w8n��.��?��U�<��v��OrO�Fx[{�?����1.��lC���]�u�+���o_�����R�3��k�&�q�����u\L�y2�AO��qa�!��k���a�mf������uO����bi']�	?��#�X��g��c�������,��T�f�m��8���U�@������#���L�?�!��X��m��}����Hf����89'����'��+
x����(MF��,]�OF|}�|F��35xz Z�32�D��lh�nm�+T6���T,:���Tj�3@g.j�q����J&P�,��)b�/���p	�8j�b�[�-��
�j6b�VY�����=����B���[�d��:�}�����4��5��_�!��-=�3x|������DQ��2���2��`���u��!fj�����[zKE��
��'�Q����R��,�V�@������T���&h�}��SX��M�}�@d�)�e����2C�{5	T�H���X-��9�������mp�������CZ,'�����Ly&1�#W�f��+�V�N��Gy��V8e`^��s]�����mx�1^-�[��,����f�<r6Vn-�9Q>�
��h��E�T�u�
G��G�8���>����\�s�{v��Lc�Wnh�3����s����5�(��U%S8"�:]�Q8��;}��`)K���D�N2-��@u�{��?��C�E�����<�G��I��wM�hT�zW���;��n�l������)*z����pdt�O&�3�"��c�JK���]S?����u�Vy��N�/��]����hZ�G�k����~�����x0��Dv���$�@+�39f���V�����?b�
����:����$�{{'�C1����p������*�����|��5�:.�1�
%]���/S���t7E��������%A������\��in�>G,����~Ch~�?r#3�a�\W[������e��r �-������x��.p�P>kL��O�1��dM$l��	b@ �[o�A��,��T
]z��q�:_�������w�
�i��.�
_Fw}��t����������"���o8]��Q:�h�����y����t�vtJ*��K�Oq=Q���m����5v���3��>^
����Z��,>�� �_����{%�v�"E���4������Q������o����&(��(I������������	)I�H�������:��~xC��@�OM@�J��@�t������Q�_tWr�!SR<t�V3��c�	���;����	���E��zVC�WQ�u�J�����ks�� K�m\�z��Q�{.)�b�����`4z��H�v��m�7|s��������s�{��v�]�t����f+�,u�3�������Z��1"^�7TJ���2u\����-��D��Q� mM%�8u�B�i���!�RD�����Y���@�[|	��D#H"?�����7������4a���WL',�s��<>X���e�EG�����c���
���U������ ��K��8�1�9NV���L���B����hW�`���������9]���"@V&Ld�G+��H{�a-�.�#M������lz�������{�V������Z@��?��J�d��<�.~#6�I���lG||	���V���F���	�&�����rX����@���q����j����*>g�f�]�V�Xm�@��{�L'	--�9������~q{+_��:~,��Q����?��nT���	������rQ&������N��C��$��������L���O[���B���2$:�g{#\����I?�zk�;7`�W��cx7:�i��iL����bW��g<���}[ �$R�r��
|�[NHeK,.6��y�/s�3����\g�8�}0�C2d+��)~|�7CQ?�	 =)
]���dR��Q���hsv�������t�DO�����Ai�b^�G�w��.�l�6�d�6@��\3�Ej'�w|%$2��>-E5�>����r0�i5Zi+���6=�Pl<����3s��x���x��\CYQ�����d�;��}�"o��>�^�h�NIk�V� ������yD�"I�'0Y��/`�7!�}�~C�`-��w�tf�8
��V*��S9Z��wP_#u�V������A�@p�T�=���7���L6�h ��Y���t��1�	����x�kE������H��%r�A/�[����o�����:'C�=X}h�<}��C���:��9���5��z����������w"%[�����ToWx�F��0��
*����u��Xd����BJU��:�Y���#�R	�l�peVoN���~�K����g�t�)��O������ �Z������6��H|����.�`s���%by,A��VI.C"��
���A��']w�:�A|#���m9l{#Y�)�=����pL�PX9C6�]8�������!��%����������~��x[�����<u����-��jO����b�y4������ ��hA�(������������� �N_+�Q����^��A~Yq��_L��m6��j?&�2D��&��3@��~�Y��B�n,g_���R�����W�������-��8#����qg|=�����s��sJN�^�@�|#h! �sVx�Y�u�\9.���������>G�SL�P��������S.���5K���!��v.�L�����D/��WVX��]�qFz�5��ey:��$'��F����z1�k����uf�K'U����-twq������5����i�A4?���hz*d��5�/������?�q�������-�����?��[�6.��!��%IM��8��1'������ub���}1[�ou	����c��R�<�d��lv�6E|��2Q��K����>�qQ���{X)��E��������V�X~6���G:������`"��`�a�-��#m����������	�u��y�����
 �wv�}0��9��|�%7�B�3��N����k�,�^/%�M�5���O� �0���)�1T=r'�r8���@e����=���%��b_�t�� {z-�]Fs�Y��������=&�\�|���|��'�5���R=c=��-2� �K�
�k�|�W��L��+�I����J���f
��~�}u������l�H����@���KT��A��7(��J.�!�~���m�s���~ut�_������S/��Pr��ziV��I�'r�A�o]�$o����2�Z�z��z��l���a���������u����Rj��c������QH�����=2?��0��qut���5��U��h�>*����y�{<A��${�������9�o��z�����p��w4�N�|~q�6e<�p���'�>?���;{!t��������1�bx���y���)��[��3��*b�0�%��[�3'��������W�5�TO��	��	�?�l�9x����7��|M�Y`1��3��[���d^�x��A8b5�`���##G�%�m�����qr�R���1L��P�X�����]t~z��zHM"�,"	1�O�rS�V+�I���T�V�t{39�������b�����=
,-�^��I����Ej!i��4����k�H#���@T{�J���M�c�Y��9���k����^�7���G�5
�f���&��4��������>�d.��_�R�g��<8���4�	�F��
r��F��`r@/Ij���T,����)S��!��rX�G7���\�3ilWs@�3�DbW?�]T�KfDE�[fU����zVSz�[}<mF�7�R�hbS}�m�L� <��0h�0f�����G��.�p�$�jH�t�����k(H`<a����H9�4� ��p�N��J%*��HF��� ����+�����kJ��dH�XR�:���d��?g]W�>�"������H1�3�D�`�>C*��:��~C�_�\(��p��=(u�*>���p|a8��2�p�p���,z�bz��8���HS/o4����v�M`�Af$b�m�H��l���xT�d��e�-����[Y���zV����Oy��v��v����,I�~������v�����������f�d�����d>!���z���pn���{��x0	�#��]V���z)1D��N)��(��h7�:���&����~�Hx�r�H�\�I$,�|M���D��aB_g��J!0LR�Z�^���B�44&�MM�7�R��,����]������KC���;�h�hQ�}�����#�0���S5����{��n��sbX'|��KxO^.���b���P�E�5]��%.lG ���j	b�~������:����~�|���?fu]<3#��u�	�iz6�S)l'%��������*���_�~��3l�3��O���z�m�BM ��[��E������./_	���W�	�I2�8��Ni����!��`�h������^���9[4���6�R��>�?�O�>�?�n��0L��5��	.��7T�zn�ev���[��ZquMj0�]������i�}N��m���Ae�2V��B���B���T�����}n��{�2��2�%z�I���S�<��z:�wG6���_��4��?��y����'�w��v\���/�"�AKN��{��7�M]��A7]�h�V�1w����:�<0�OT��'+�z��!��6u��O��Z���(4V��@�.����
����zkG����V������J�n�	�c��w��8t�����.U�I6��/<���]D��'/cm���$��T����g3egl6�$N���2���g�Y��Z�6�8w�5;�y��Q����f�py?��RM�ud�L:-��l�W��?
�rd=��v�q=�vb����8��)��o����
�(f�����~c�%�sE������jd5���Iv�;�eX���^dW*���/�����`�Az�>���}�;q�((��y������4�U�KAQ��S�Sp��V���M�
�R��ro���\�2�v5(s�o8&�h��w���|.��K��������q�X�q�`bj��JH���&�o��
o���*������e�W�0���1L�2H\�b���#�|�$�{,���
�L�\�0����v�=���Q�2���I�%�����9�����W����&�������*������R�$�\��kEm2#�@�v��f/,���J�s�"�"�3������"����>��'�
���@��������\�Y��m�Wl����:��{���.!�(���E��2H��H�ih|���n�E�����\7�s�M�=T��.�U]�����S����&��~��D�Q�����
�(��%�j��=��:���'M�0=<�i������&8����rP��|3@��1��A��4*��;��������A�F�r��m%n�h�tQ��+1�
C
tbIxv=
��FSb���c�37[Sb}�r�]!q�XVE��NEH��6�p�Ml_�S�c�����$����d�P*�~��� ��&��t9,�?,(�0K@��51%^�2��"�%c�(^��JF��##�Y�:G�]�k��O�]����9U��yr��RF4�%#�!��G����~&z���hu�om'C��h��m��s()��>����l(��j�S���N��S����hzqtR��Cj��)t���-On
et3N���4�������H7��%�:�>�3�]Jt�:k ]�|w� ��B2�����H8	�7<�{��X���e���08*��2GA{��D�n\��d
�4����z��Xs������>����oO�nh84[wnQ��/#��3�����,�\��U�z�V��vr�)�����3>[��~�"��������DC���<��RW�*B^�ZU�1j�7�M�1,-��������
o�tBM}���w��CP�e�kj`� B����==���\�X� ��!{>�5��&�W�C��c����W��U%O�F�Q�>h`���+6nJ��Y���"�*'��\t-�.��y��u�O���V|�c��c�9�x��W��M�"��C��������M�~��B����s���k��F�m� eF�,vsW�z��Q��P^���B_B��e��Ky��<Q 3E S�{b���b��b�L7Q�>lMi~��.�+75�R��Hm;�	�}!Z���B���>
��x"�}u��f]��=��"T</�I7k��}���]�)Vb��W	�X��i��n�G��M��4'���h%�3��.Dk�0���t����%�6��<Ep�z?,]�4?~<Y2���:J������/S5�$4�s���N�T�~��'���-���>E���6��,�
�G�'8�8<J��Y��;FjtN&�U1o��[|9����`�q�2������vIW�h<�!n�+Pe���+�l�����?�+.o�+Be��J�qwW@P;����
$�� s�G�9��qW�m�f�������dm�����k34�c�,Xm{?,�"b����2n��R�L`�a�tn���.M����>�_lO_a�P0y9�5��c����^d;�0�����[1���F
uE)�c���������J6md��9����iuq��s��������W�:��R�;�����
Gg�@�L}�U��Q���g�M�E^�`P,B�r�46��[C���n]YQ�����n�q������2a���L�f��������u�8��L�m��g��TQ'
0��*��������t#������J -����8�1v�2��T��	[��E�S��w��J�x�|eZp����)���
�0�[&�bT����Z7Z���q�>7��]>�5���DoTS��K?�=A���������+l���][���,�0�u.?\l��d�T�����;��=�������;�%v]F�s=�9�>�Je�<�T�a:[���R�5���gT��y���A�z�����`�L����e�5�_���(��4���v/���
�����lk�e����u
mv��#�e���pJ���	�Fy	p4e[����DW�v�B��}?�����"���.�tmd#���x��F|/��iV��>�+�����11}E1�	
H��4����[��JN��;vi���[���FQZ;P���>,���+x�k��H��������	�nkBU���QS�fO$z�$2o���y�M�]�^k�v�B�}
J���d���rI?)�u���i�&��l{�������s3R`0!�oMyEk���(�Q��,�WR/��{�����!��� O
s\F���	����-i{�LfK*��-5�j\�,�t���(8��rf��X�xz�x�w��1�dq�v��M��d��J+��!�^������6�_T�����B�bn�E����d8Wg�5���� -H����6��j��-9v�cZ6��%�)8E$���Dx���y�(��h��|����
v�r#�Y��'�c	_|��*������fK��_U&���[�������Va���(�Qf;[SV���}2!���k8z��U�s�-�Q�q��	�w�+�D�r�'��"K����|w�n����K���Y�d����<w�c����t�u2��E�6��HT�{�l����������	��n�h����k6���P��GBUf�/R���SY�ee_.�������h���?�1B�	�)R���t�T�����E�\Q}]a}��IJ�If����}�D`��1
��B6����O������e��y�8����R��ig^v�m��5j3���i����;���jT���>���<m	�m?�T��+�}��x�
r����t-�+�|�:>����6B�
��R}eo���#K�� �!�=���n��io������=L����x��y���V�BNO�p�
��
���5��K��.
�@I�"��j��O`+YC����aB&_����c�J�*�!2x�C(�Z$>���~4Q�rZc�}T��q.6C&g�*����4�F9�VR>�u���Mo�H���r��W������w�D��7'-p�3����U�B���jL�����v���O
0|����G�%
0a�sn���q�
�#fh6s�SoC{E��0��Wz�����*�Z���3|��l�+�-���h:&��<�h�N��6���p�����3Y���i�`,�v��{�KY���l�O�-z���	���s%�^R
d>���]����q�����S�Q����75��M�@�\�Y�����j�$J�� 3,��E��&����&14I.���<�����C��	l�� �!�
R8�i<�v8eo�d���9��~�kFo7rs�$c�s�j��t�ivNB��a�m])�L1VGL������������	2�;UW8��|2����^]�W=���:9�6�:td��_(�_�?O 6�v�S�`� �-���
���g"k�Z�=G�����8�C[�Bb��
�;���Wn��3��r��/��q�OUz}� &�����+�0�X[h�!4W3�	�?` @u��$����$���m�
��q�]�k�.DK�m��]�{lm��������L�S-����^�I�i<�Z��_�&�cH����\C��X�.������'WAU)�F29��8&"�FN������2�������&��SV���r�e�R�P%���PYe�{O��f��o��n�@jJ�$��m������\O`q�tX��f�	���2�r��Zs�H�l�AkLGjwcQ^�5����_�:z����!�|ZS#�45]��B^�R���j��wt�z��#wrZ:R���m=dj�����W��d����T�i������"'BS9�`��+�U��!�M��Y���@~��x��~������M�eE
���m��)��-�6A���tc������h�zz=*\�d��kt�pg�pG�T���C���?�7�D���^�P��8�����@�V�l!�iO�,~-����d�m��b#1�F�;�!@�
"���x!�Y�^��Zqe����?��""���A���^�}���ib�k��P�����-�A�?���{�=ZGF
�cr�������)0��x3n�ll����x���x�����s��YEY�v�1v����X|]������Ng��
R�|�~V� ��l�����~��o�=`�,a�������G��0Nk�[�I�:�\�_,3�Q�@.�Yl��s������
����G���2�-~���\�1����9p���#2�fluB��R�p� t��{������	[�')+U�Ey��rF�H��]K�������eBY���:NS�aMz��B{��mM����=E�"��[)�W��&O����>��s�[~P�������Y�1%g��p�������!
Ua]���
SB�U�
�>�p�������\.���4%����D��X��j�'�(��?p��G:��~�\�d��WiJ��?������~�����d]�7�dJ����X`'����X�����W�x7��7����]�,��V��]��LA��e�z�D�9�C��P$�W�������@��(d
��(�C���e@T\�r�T9����2�5�#�7{�F�BbJu��8/g�����j}��w�Iq��[
;)#����&���5l���/N��� �S�h3��r0a�,B���qI�9��z{����m��u����'����������Y������P06X��*w������<�q�6fC��jZ��.��������4����N��6��{�(p��u�$���(L������~y�Jp.7�	�?�u���l�p�T��D�W�����[]���w�s����&%��&��xM^_�S4���'?_.���J�����0�����vDh��L�"��]������t�x��~G��_���8����JN�����6#�����W^�Z����G2������e�u �����rc�{\a��=7)}��['P3��r
�\m��o�q�U^�4Dt�Ta�y�5��Wx��7w@����La5�x.Q.B#<�����e�;��r3q��t�����74Y�B�"h���5R�x��CB��ux�
f�W�+~G����&;zz(4�"�F��l��P��y�������+�H�,����K���)������fd���#�����q�3W|cfQRH�e�o�������-�1���)>���
J�5�*�}
G.���6���z45�U^�E^:����;C�PXw��l�K�q����#�~��tB�D�F��=��������L�dcV�������������!F&�N�t�s����&E^��8a����n*?���\�K8���~R^��9\u�����l?���u�W�n���<8^l���,b1�6�����jK�u��9s�tV�������3��$aw����S|��B�5u�����*c�j��wC���YF��!*�y��[���(�52u��9��d6��5u)�������\��}��#H6�����e���
~�2co^�p�7��C)v_�;K�,���|��w��y!hrK��,�P��n�^��Q�FSY=�3uy��I(��NE�/��]���%{�������a�*���S������&;P�
O��1��W�U��[�����PZ���m����r������v%��=Z�H?�������/n�^�	�G{BL�b��pl,m	�n�	T��
}j)S�7�����M��O@*�na��������4�~���
�V�klJo�#�r<^�_���3N�l�0mm|d8Z�I�a0w'�D�`�}r���%LF6�7������v��7R���-�������)|�v6Sv�K8r�����(,�7h`Q�?����r��J�T�d$)[�u�5�?>k/<���������1�#m���*`��27�+��!���P��Gn�����gY}xl��.���9h�.P$�gi,.�� 1�>��o��Wo���n���-���d���%���?��GZT='}�;/N
>L��?�z��`@�6��yN�������'}^��X~��9<��>�M�L�Rqx��p�S(��p�&J'��d��N'�s}�s��Y%u�BD�� ����� ���������j�l]������}�$c/��A������B��B��5���A'�7�F[kWM���f]��
B��.oU�v������C�1A?�:�sd�b��i	��OoV��a]��a;�I�������,��i���9�H�S�2C��i��G
��R�\��S��:���
w)�����1��p�"XA���2M�
����7������N�����R8�|�!5u����v�v��h�wsk���e���u�c1��������D�<Z8��A���&���3��gh1x�����UP�x��@�j@5}Z������aCF7np�����|B�(�qW�[]���S[e3se��x����8��25Cji�N�����0)����H�9d ��2�C~R�-..�i�m�4\�������d9���H~�6rS���M���Wp�n^�$����T<t2���s6����O��x��b�c�_1T�=��&<=�Nw�0�C���Ag{�FTi�<l�PA����-����{�-����&��q�}��h�G
z8�g����e\7�}=�i�Q�cQu�7
%Y2�����s�re�ZuM����$+�����a{|+�;xQ�q�+��������v*3�
_�^��m�XLO���X�#!�g����tf�e}����t��c����1�c��B�������,r(��:����Z���Y b?Z�p��4OhVU��C�7�|�,3b�mT��}iWh������~�X������������|�d��yn�BDkBJ���������QyQs�J�0)�tz�L��S��~�6�\6,����R�{U���~2X����v��~	���X�Ia�{����&qJ�8�����Y8s}#������h�|�V�0�?�ml��z�����'4�d�1O��t)Y�3�n*)���jQ6�8��L�w���=���S��z#Z9K�n���h�K~�BIA���E���o8���N{��������W5Y���M��7;�]mY^�/����\���J�� �dt{R����:����� ���|��v47��Z9��n��g���=�[A��]k���._�_�joK�^
�R�_�N�^�Qx�f\_uT�!�G�����v�.a�Wa�>�a�ye�W#����'K�h{��W�W��W�!W����OLy!}�W�QFE�FM�S�K���U|�&S���D���f����s
��f��R�Uw���.�:��p������2N�Z��;�^YvJ����Hh<��q������vU�����y1y�<wq����kv^���������,1����:�,��<����~By��x} �p�~mt��o��2]<��?�0�������j�\�o�8�����V����C?�Q���������1_��A�q��ze�i�U_��Nw���ex�M���������,��:b�If����b�L��pw���*�j��?@�oRP��&�nEC����2���6W3����w��s$�OE�<�;��h���7?�xy��p��:f��x�Q�*}s��[i���^��F�sl���u��������ps��#�?f�a��a����|�u�W�9'20�����S������pV�~�_��=������k=�s�g|�p�C�&�����k���&`�z��)V�=�j�~�g
�r��=��,O����hh�=f�,������_���l���E���K��}k��
z)_c(y������_���'�yR~���"����kCx��������^�m����K��KB]�k��GQWc���e�b��������~�o^_�5�\g5H�\ �$��5g�8NZ�V?���(K�(�"�2�r�n���z�1lW���t���Y_��~�>�n��Q�*�i�������~
��>T�]�S�:H�_�n,:��z��<j�z�����7nz`��`������{p,��>�(k���g�p~���pp������U�
Ql^��?1���.����K��~23�����'�yY�`��a��s�mW�>u�w��I�-�$O@����QaD>
�Nd�0����"stcZ����p�9��1-������I\ko�������x��cy{N��DO�+%&��>��iJ�Zy�e���fl�����%��m�uy(Cc�B	:=�!4����~���/��qO���l<�k��6���S�����������������,��`9�`�]������df��S�K�>���_g��
���_�k�����.�s�l���'���_(bNzxMv.8�7��w��\b[�M���~7/{�I���%�A'�j���6��uJ`o�����G�p�/4&����}l4�5u��L�Y�7�v��P6�0���;���
�]t)��8xH���>�"�F���A� Y5����&�c�c�
���?���.v@�>.���h��e�&1��laA7�S�I�l6}�
H@;�����qA&L"7�����@�?�$�b[�-�n0$@)�~76���wz�������h����l�|\�����B��(��� �+����i���9�\�eu�s�!������<�g�BNx�M�N���_�'i7F��hz�4����1���
���3���;��>dH"��zv3$�nC=;���P�#2@Cx�	Gk~������	�)����@����Sa�=�7X���?��4#U����<����*
��Yo�gz�Jr2�`MR��{k���l5R%*��f����Mc����}�u/i<*��e�ww%2��O������	�3��11F�]�=mR	o2`���{I6����{I�C������GAW�2M�h���` n�9I3��V���7����{
2q��2���5�C;Q���J�,�6my����$��{&��[���oqn	������� ���"�$oyhm"��	'������1y�uY�I�=��F����\8�u1�\��O������D��#h��a-��/�����}�d
���B���H�"	j��3�a[o[�W?]�Ij��5��[brT������T
-e��J������R����,e������^�� �D�g�b}n�k���c6\vO<�,�;0��%50�X^LM����2�����3�+a��
�����zP�}�z���s3�����e"�k
�:�?c�)cYe�V��H��<�1��/�>\d^�,)���]*�Z>�z09U����	��6�`~��7���; ��/�����B6S�g�BvY��&��\���L���Y��?}��������bi,F��+}j/��E�X�1�G-��6:��u?��j��*D�e�+S���{J��D����#�I�#S�������`��
;>P'6o~��v�+�O�q)��0�"���)U����N�����:��eJ=�1�]Lq����;bAhbA7W����M��q�sL.����2����Z�X�e�;��:	w�n���V�������AE�4�yy'���v�@��^I\����������y%Q��9���)�-C�
 ={��^����$�r���7zx?j�d��{����G,�4�5T����^�T$�������,2�h�� �,?m�,j���b)��@EM�0��|gq��y?�O�J�O)�Z����1��4�f��VY����<��M���shT)X.����01�~~�.��+�S�������<Mw�m��<,O����(������(��4�
��X^}�����������U�6����I���4I[�]���7��|�pL3i}���a�rc� \��<��e��%�������}�p�+��ud���x���IH.��b�-_�[Q��h��.�G��f�o[��G
�[BE��1�d�A�qo��A�GJ���.�u�|�G�R�D��r+�r|��5���w���pm}W&�����^����#'�������C�	;����	��H`�L��f�!�s-����	2~q.z~qj�y*�����	��[n�z�
�f���Q��i�����6J�Z��Ib�$G��B��"�����6p���`������A��>y�E\;\<"�X�$�C�;K��<BLe�S`��.��[v�N������l,��*�o�����{4�Jr�H���'��B�H���0
J�R����O�yb}��3���	8��>[p��Y+�khw����-~FRel�'j��w��n������lm�����;+l����kql�ORR��������3Q��K[�>M�g��}���Y��~�P�8iB}k�����G��?QjQf��q���O�08<}_�(7}'�{H8�p`o���������p|S��qaC��R|sj'Zkp������R���w�:��Jg�h���9�g0O�V������)�� o�O�=y�@X,^���GN��_"�C������Xm�����{v�!y�����<F�|�
����<<\a���$u��*C���w��B��ha�y�k<�@*1�A��qK��cD���|r��?]��<�Vk3t�[qsTs�X�q��$����� ��*t��$�W����Z$���@��5���b�Q��F)���o������}lu��z�q���J��r������U*�i�]
���z�2�f��{_�
m� �����HH���f�����=
o�D{��G��U��VX��i5��hH4Mz������#��DbB*�@�����T������I�$%g	~w>�UQ�ex+%��Cm��� ��T�������)����A��X�[X_K�B�@�
��<�����P���1����*��M�4�I��O�E�2�]��?�$�Y��|n�����=��}�D=��[0��*~F���������'@
0K��w��)����S�0�����18<��k'����b�)��:��1��1�=)h���W��&b�����q�Wa�}�8t�y+>f�[������L\�G����EM>(�4TxY�`xS&�T��	�.��4���S���<����I�e��4���������Ra�\���	���'�@-M�����NX�M��R�ib�}D�N@��Hc4yI��I���d�H��BS>K0Q�i��f�����9 �; ��|,���Z�=�zU�G�;����%A�d�����������Q��QB����+�)���FGel�[��<&�/���
���%%�����n�GR��S���3��?Q>4��mEd�)KZ��(�+���+O��_�_����$�Gp�(#��\8�����Q���E��[$y��,���`X�o�� 6�HA`*4)�U��>@���|��_Lx~������������0�	[�S��z��{�OH3���#�"?�	����E>O�QkI��gZ,�j�z,����yk4����eRA��3vtR�{w3����X������g���Q7�S��6����T���h�����_�G�E��lj�>-�n�jP�mNk<�������+x��Y��C��`l�r=�iAr�E��9����;��/������mm��E��7�����g/W��n�]��C`��9���F�Z���7F���GK��������/�w���UE�OV���3�[��w��gWW��;�����.w�3���Y�46}�a��YG�;;�������>�
R��������~�K������v������1�S���o|�'��4H�*������y�|�1�h.O�n:������������w$"��������_(baO�n�����b�Gy��z��%��Z����p^�p���������������~�B���������?W3����3���?y��R���������C����{��?�(�r�tq�[;�z8��w��W��!�((�h�=�t`8��M�W/���U�((�PPp��K�
x�����?��'�����3��<���l�]�L]l!`gA>+{������*��K������
�?���*�Nk'Kg����g�Pr�@��oT��A���Sq4[��?���?����[�[������J�.�.�����'����qtuD
��H��tN�[SkKg�'�.���|���_�m�0�B�o�f�lke����da��=�e������@O��teAAQ��|�#����/��~��#x� ����b�X�������L���5*�����o���[�-�������o���[�-�������o���[�-����c���f�h

sort-10000000.tgzapplication/x-compressed-tar; name=sort-10000000.tgzDownload

��u�Z��uT��������H(H7H��twKw#�t

�!�� �Jsh��E��s �|~�w��z�y]����g�����g��k����
���{]]kJCskJCSC*�����������������w:Z:F����o4L��thh�����n
���l�m����u��o����������D����R}Q/V*��=054���`��eaab��echnFmg���\������B���������
������{k]]��6s����1<}am���j�i�XS�\�/������� "PQX��4��,����q����o�����hhhDDD������lllo�����QUU���166���vrr������INN���...���njj������ssskkk;;;��� w�=��`�~�#�0���g7��*��\�Yl�����<&�����C���>����=�T�ym�_l���=#��S�b��3�P�����<�Y��C�
��Z����~
�pYF�'�;|��Yh|%�_W��2��;xV��e�A�Q�H��4�R��G1��56����������2ljbc���^���{��5����,�G���P�@>�v>:����`��#����>��l�C���5d�<��$�����p`��*��s�����D(/Z��v�VN}��J���qr���ye��?��#D�
\����p��|#s�K��J�-���{�x����1����f>Z��[b�0��b�����/a����h�����t�\T�)�,�����z`�y{G�>!`��Y��-��������O�N��1�OR,X���H�p�=�8~��������M�>hW#��`'�b�������]��O���]����\&�*�Z"8p���������z�h��J�?����^�.�� *Dc���)
��X����!�mQ��($"�4W�R	l;��U�L����N�q	|mJl$M�?������W`;��=x<{����7\D|-a�];�e�i��B���Cc�-�y����c*^kL�(f��_K��tA���)���`|9%��T����bY���t����Ib�]����,d�#R���p�����:��M�c���`J�w �)����F���U ��J��A���d8��^v�n9[S>
b!t����_����I�Y��yg$��*T�e��(��/�9��}���y�f�H��N�IN��@�e��c�Qa7��]n��\��Oy}����uF��_b82��s��V�K�4S���{S���+)z���`��R�7�fy}^�b;��������0$��E�n��#�����>�M������w��Is�	B�}\8�;��X�A��xG�k��e~��oV�3����Ru�KV[40K�M9�zk�M�h�&mFiGI��K����B���:��:5��(!�_`�g��>&9~���x���is��������a�����B�I���I�
���j�ib� @��x�Vk�����_U����O�����^24��F�a����E�xfJ�>(@P\5�A��,bX�hT6��*��qm�����������"UfUN2���	�Q.`}W���(��}wM2������?n
my����U9��� ����0@���1��p��u����s���A,���Z"/]J���M��?����:n�-
8s�J��e�;[~�tr��M�J��@h�f�����MA��I�w5RyV���l����u��GF����:����Oh�O����Ue��$lL8T�`5���y$��+t�1���Fc�iP
h<>�uHR�6�-%Q��{�� &����8��2[)P�)+>��U~���l|����u�5�8�r3��������o�B�u_�h��r���
b����e[�I��?~x��ja0�����S�����3+Q���=�b�Kw��������oG�pbRZ}��5^�"vS:��;2{���T�C�����=���;�-����XA���
G?`-~��@C�a���5}���g�dU�T��!���������/u��S��"�Q�4���t/|�SS���H�����1�V������?Y�=n�f�lQ�"���5J;^�X���Q�+&k�~q���cz+=����4�Y!){���[��+	
&���S�_<�Y�O������
>�
��!��������
O;�=T
���[1�)����e�%1
���$�w:I�*��F)O
9�l��"�<�������2��G��u�|��!����h1G>���;��0B��D�m_K��Syn�]4-�����R�>Qo7a@mS�8j0@��������;-����x����X����
���wt~�bus������`������Tk��-|5��;bBD{�9�`O�������H���	���8eQ��������<���V&�2������6W�Q�*�LY���_�hdf�;��Q����xg��m��{_�����e��K���h���i ���xo4����ga�7�l������"��H��@���(|@����w����7X��0?�O����1C���Q���5���DK[�����p5�DS����Ej�������=Uw�(|�����C��1����Oh~[x�[}��"V����*?k�#A��W
������9s<���#�#	O��=;5!�8T�P��9�(���Yh���~�jV��7�/'!���L�5_4����"=�8U+[b<d/x�����$*m��_7��hQq��!���,�2���XX[c	�1���g����}`�@�0��<$6�34�i3��������&���6fo_`&��v�b���z����:�t��;8�{��������V���=I��_plkZ,�(�	k��B]j�A�.C��������FC��Xo}|����0*	$U�3$��U'����2S�������)���R�����_��i�1��9��m%�X����,M��(05�F�Q�<����&�U�Z�C���*v�x=��>��qB��FT�3I���p��w��8,XT'����;�K�?�=�:W*��dt=�X�c�<?�sc��v!�Sn^{�tb���6-��rp�?�>"��*#m��D����P'|�Y�3e)����w��?�K�N	����
���a��=���p��_�\��M�pN����������K�*��d��I:h`����K�NhKY��<wf�e�gn���;($�Y�Uk*�i�:S����mxX����Zeo�������F�����]��{[Gf�g�����7���'����8xDhP&h����(M��Q��9�r�z3������;�}Z��l�g�ZSGg��#@3j���u5��$�s����1B@a�~���}���e����:[���(��i#�C�BJT���
���*;���������wy3���'��m{�0=����i�K��j�6M:�����|����7U��9��)�N�DW�~�u�PT����o����iu�cg���qb84uQ�QvW|>� �If!%�L������P���'���B��Yu`Yk�"�%z!�:5��� z@�j
;���Z����RH
]�������1�����O[H�^r���~K���q��R�EL�/����6U�WA�X`������J6A���cQk@����Gh����A{�A���E�B�/2��������w�Ey����P�FF����sz4��^�
:f�*��$�Bg�gB�+W�V�(?��S�C������(�	�4C��[������8O
R���H8+��a"6�b�F����
g��v��O��Fa3���41���v8m��RG ��L�$o�����>=���7n��8�PQ9���������n����eC�����=*	��5[��VFz��������,�(c��=���0�R�Zq�!JN��|��Nlz������Kf������J�Q�X��g2���aX�c���	7���9�����xp����[?�\5������C��6[��sHW%=�N��K��z^�&�a	�3?G��nCE��i	h3��]
,�����%���&:���j~��C5�/2����,��&��i��d��#��T�u>L����!��iIIyE��=gR���9G�Q
��i�h�D+o�CD*��@���l����a�������E��M��-R:�����BM�l3�G�����<Xe�
=�D���9��$��x8��q����#�%��{aI����W�QX�yQ�����%>����Q�&@z�[��L<��E���$��e���^����e���� �R���Lk����� &|)Lk�o�7�~��v���Q�:�v�d�n0K��Lf�$�Y�?���%	�0���t�y�����O���<��[������V[�,�EVTZ;�4!q :�J@����:5Pu2�<$*���g�cx�����
Y���L��Y�K%����&X��Q�HMh�`X
4G�����k����^�{a�F�R3^�.�H$[�wOP���a���o�:q~����8l������)d"&�|K._5a�u�J��&J9���;'��Sr
�10���m�M�cK7P��:����������k1�`n��/��SA���7w�x����d3(�}kL�>g1�d��f�G2�>����<>K��m�y5;DM��b����n�b��S�����f#Jve���&�ioc�_H��k83��'������:H]��ly���)������i��K�(�&FED��-��.s4{�u�a��Wl+p�������*(P��al<�@��� ��V��5���X��Ff.��GP7p%�������;1���Z����<��9�����z��������?����<�������e
�V�:�`�R�En���q
�����y��>��h�ptT"p���i���G�;4JJ�tn�Lv?�5za�������� FV�z&�C���,y���NK����++����~����^�N����2^���#����	E�@;�%��iEOaM�E#kT�}��=��f�c��mW$�X�X��'���g��:���r\h`a�a^�t
��j�Q�����GD������x�?��V&� Rd.��b.�>�k��g�K����������~a�uq/�g��R)WA���?.
�x���K�P��Q��I�'�����F�a���Z�����#�.���z����f2�d���55p_*����}K�WV&&�b/{�!�����{�?�:
cN�3�[��ou���V'�O���1�-�Pf��9}��
$�ki����8�
�/YW
wu�YC���Xr�0��q��x�����~���3�]�C�?�J��;���N�#L�b��L�����o
���5:��0�\�������kl��(]L*q���C�tA��V���������H���?�
8�7�����?���l�a�e=�\1l�k
j�����0��6;_�n���w�]X���wz�����o��^����:MO�`S�������c�{D1�������\{a�:�h���O���<�y�2�M/7�9Seq���s�|B|��j8����
Ld���Sy� �}^J�j�H4�=ME)�x�����cD�8S�
N�(��qX��e��G�I�n����'����7��x� ���
s��G��)��pf����������G��[��s�NG[�0�a0���'fw������k����&������g(Ix����93���K��������i�$�3���{w-���g�$���z����T��@�1�{�����m�vf��!��s8B|�����G��o��J/o6�UC�v�F'�S�(Hf�
`Rh5��<�R��k(V@Q�f���c+���<E�4����6vh|�+"�W�����E(R����g)�_�V
G�qR���t��v����^��H�e���X3���o�H���������P)I��-"}K��U�!2����b�$����
��V9�vA���7�Y�"��ZiEu��������j�#E��TU��2L��Q�-e�g����
�T���}^5_!|���0�����bP�����	���8/�����dA$;��t��<��)���M�!���wL�����'m�������ze��;l�?��B����R�?�.���Cl�i�����&��ZW�U�H���)�%�/e7��i��b{Wkac��o@���]��R��`����|���0���<^M��CzXXX�!�mJ�e�kcch�oM�`j��<e=������p�K���o*2C@5�+C'�Fun4��/�a��y^Y�7c�-�5�i~�S��6W Wsv��h
�+=�����d�#��t���`��r���8T��������
����(��
�9�'p@�K)j��)J��;�/���0?x��J"p�3��z�'h����������z��%�D�*A�_Jn{�Z�p�8I������N3��)yJ��^��C�k>S��Dp��o���j3�������*R�\N]��:Q�p��b����n|;t��E�L�c�c���������v��j����B��^��m�~�����0�����%t��2D��&��k��u�ey�>2��L�M%4��M�Gu��;��S-1+S����#B���)��=�����x5�.b�&6�=���O}!o4I����
��"���}����l7lOY�RVc����2i�;����6�\�i'��|�u�`���������t�����v�pR�}�x�a��3���=T�*�J��>.�zAe]���[�2$(v��Y������r_���%���J�!���	mq�������T����=�W�W���^�����`B_�����C��p�r��1+b����"� �����];�b���g��E�o�s�w�W�j����gA?zy���D
���L�B��`;�*,�?:C��F��P����8~E
�:No?�k�VE�y�Z���8��T�u�����X�x�7R?^��7����6r<i}3���_f)]~\-����K�y�c)^�z6Dz?z�z��;�$\%��?|��8���J���+�}��(�	9!��������k���3���(��d2��'��-��/�H������$���c����5E�G������tx��g��8M��yQ}���:7sg��l�����i!E��������b��w�
�E�����|C:5��ln��.��*��\r��R�����o�X=���(B��>�c�^8����~��^�
�������������4�����dp��MR�b����j���_���9�UwV��������>�#����e����|���6yj|�%���%R���N0���q�>��L��'������EP�^��r��u~Hg�����;Nt�k�]`����MR�����>��������������(9}�'@i�:C��V���~\���_f���:�F���E���8��et������,<	���O�c�}��rl����3'�<uD��9��Dl�������f*�\�?�I�*�U�!�'>
�Z�r$�K�L�-�o_!@]��*�G�����5x���U�������?iT�O}����?
t/����Zc&?I�����Y��tJ���'*k&���>k��)�����Q,ckE	��n�Q�3�Jz1e�����B��pvQ��<�z��������.�����M���N�S�?���Q��v��G9���Hjv���ra�d��]�����+r��g������<�KS���Z*��s��S���"��R�����
���&��u�;�>�T[��	3���Z}��-��E�������z��4��:�km��D\Y&�U�X�]�4�����������!��~������ws��l������c�2�X�O������O�~;����	�p,k�%���j����5�����A��yg{�L8P�q@6�����T������u�G��_��qu+T�Z�b�%���H��<I5��x��
U�D��x��0@A~:���O+���],fV�I�0-�!���H(���(��-I�kf�"���~`��s�����g���"$��B�����=����De�d�E��$1����,���v�G��;o~��J�t�n�$����{�2���M3�n6���h��<�p����S��&������Lr��@Z9���
6�R��=��Z���o5'?�%���Q��u�������<bt���+��*
���]m�!��	�@���X��0aj��X�������&?����:��JuP�R��LP�?��;d�pM�����(��#��������o����a|�.�,^�4����2ux���|tZ�7��*��s*���[P%�x���&��f���g�����T�d[���}�3
��3�_�c?0PR$�^��mpqC�'�Z8����������
��GJ"x���H06c�-����;� K.�\�����1K�6��K�Va��5���*��X�~%�?D�����;�,�3��"��r�=c/.�4��_+�1�#����)=U�,�K���������C���BBd�T���:8K�$����$�\��$R	��F��An�>����)�'������.�������3�aF���?�S-��R)�;>�����C?�<�<|�4������c���Y6d��D�I����}���O�,+6T��8�������������p�� >���3�h/������RG��@J��~Q\���=�@�stKgZ��W|�c��^V�i"��6C��)��U�rz����"�S�������W��(.�G^*�p�q
��"����0�M{.�f������:���`��DMN��i0N��|���Q�9>�Y�u/6'N��0^]��@+�B�������W�e��~{��#f\iy�

�eRa�]2����&(�g�?I��	`��;�eA>�����r=���x��-�>5/T\a����&�|�[��X��s@�e������
��xa�S�]�[I}���GJ&Z�La�Iy;�`�I����@8�fz����D���~�F#n�#������(��z��+��+DS!h��}m�V�:��X��P�xY��<��d${Z&��;�<�M7��k��1�Z92����)��Lc��M�`�]�r����C<-�u�����!0���Q|�W�A��7����EW��j�#:�v�GVa0��_��������L�J�WWS,{e�c���J���k>�3�S����l!��Wd0�,����c�,!1���E�J��3���H���	�/~���P!2�O�g�RN�����DN���]��?�������mB��O��!:���������o�4�k���	�n���b_&��>_uy�0p���h����Q��%I�3|)s"4Sf��t}�|^}FXE����X]'��!�*�����6@}_0P���0��Xz�Z�W��d9�m���\\�X�J?��������Ot@�JQ�Jn=$(L:�AL"9����������
vc=���j��Mx��)�2j�G��1]V����[�NZne>gn�G��',}��	���[����n8�����������5������������8��1��TK_������T�������Z�[�Z������gcnnb�e�k������tttMt�?��_���?F���k����z���#�_���2����������������M���t�m����u�v�y��xn�{�v�pSCAY���4J�g�V����:Y�9T^*%k5a`.O�~�sI��n{@QF�,��/!���7�T���o�F�����oy�&���7{���E��`�S�D���.��E� ��Bk�g��xA��{��8������l���|8 ��x���N�L������'�����<m<p��EZ��w�X7f�4kI���?��L�"�-�O�l�zL93p��q��lO������X�f�p�tN�s�=�_b����;Y�8d��/��`>h���C������)�K�_��RX\�����%���5��L,��kgve��)A:��>�����r �u1
m&#F���q��2�]����_n��h�q���|j`JV�q�>����l��`������M32i�7XE)}]��� ����s1�����;�?�*�g����5������^��t>C��I���F��L�V�.p #��4z��@R$W�������S���./Z�8;���>����;����#�!U��-�K-9
���U��#!=|6�f�e,3������"�����v�;��%���H^�.m��><5*�C�??Tx�����
��&��\�����5\�>?]��v��w�������BZ5���i*:U�:_�(����w��*������G�����g��"��������>%L�����y����p���T�����������8&���n��E��	�D���Z�y9������sW�����|�Ig���1�Gj�����k�uG�]��	5nG�>�=�S?.f�v�'k�?X�p9G�k�(�IL������H\�����H^X�<_��<)>-���v��Q�1P�Zt�w��q(���t��3�|'��x����������.����:��?�>�nGD�''�w���mN�����~W'�k�fb�$L��"�Z?`�D���|�W����G�..
��V~�����?K�Z��9`���b&����u�z~�&�wbo8�s�����X��s���V��8���pw��u�:���J����m��x������D�L��$;x�h���/��1��bc<8B�����<"���A���dEc3����oS����}k����������+f�bshL]���\D�3��f?�8��{��p����|��lm������?'M��.fFc�������������55g�=��|��MS���#����{�P���5Q��]���z�rJ��w8���4B�po�D����sa%���NBdqN�������x��J[9W�)ZI������J�s@�������Ec����r�?��'��O�5����	���ET�y��sog@��A����d���F9��8&����iW����������������xhR��������������3@��W��s�K�������%�2/��/��N��.�sg�	5?�JfEv�:�.�.�����.�~r��Wk��/��c��<�+.::���9�m�A���v��p���-�p���8���rQJx&r|r���_�}�����z�Y7u�7���P�s������:QW{v�t�w��s������������%�3.����`�	.{�M���%d�b#
N��36|��������� |�N� ���3�=�S�nG1	�3��G'��'3]'[��'M	���gm�\S����0���N���{T��	��c����<J���VV���s��N-��6&�
�:���Vc5��9>�t@i������F���\S�i�P���/p����Z�n�q�� g�yC���,P�y��A��Rs�l��/�;]l����(��������#��+����������-��.3k������\���wB��;����'��7p��fF��&#���(��&8?����D�L�'���6�C�����k$ bk��q���q�-�sZ��h��Z9�������>\K"�k�'E}��?/P�.�����+6���������V��"��O��~_(������[R�����;�������z�|�!Jiy� O���f�?����bKW~?��l���bg��x�t��OL������7���rN�.t����"�\Nf�d�.��.��&'I�N5]������ ��z2����ag��� ��(;G,����~k���h�����b*=s�q�b����d��M!}�kI��y�-,���+�*gM��[���W��z�3������c�)������[u%��]��S�7'l�IlY�3q�l��O���7����o�q��t
�;;�P�'L.���%�n$����t� ?�F���VF�?�����v�_I���?<������p8�`��ru��c�N
���o��&]A;T�ois���b:�,�Tp��:Yvhy�I��Qg�'zV_s�VU�����hU������3fJ0x��l9!�i��G���^�l�_�TA'��"�G���L.'��Gx�M�Ch���Bzf�������'{W�"�{J�Xv�]r�^2�������<9�kC.k]�B�o;�~�k�o�C������s�/�8v9W��#z?�-FL��v]i�`��x���X���v$L���i9o�rs>*��ru��\mC=��p�;��?������u���zq���P����>*Q�3y������[gt|�{��������
u�V�0��	�u*
�����s�
Yu�/���������a�����w�������}.t���AVp�X�v]��N��S��BRP�����q�iu�f^�R������[��e�N��T�:X�����m��i1y��0U{x���������������
����)��k�i����E��^����`���su�{�=���i����n[n��c'�����u�;�1�b�r1?�1L�#d��	;{Wwl�fX�6��7��*���y>��/{�xx��V_�����3�i��E���J��y��~�]�r�Vs;���:�����:]?�r1�����o��z:�\��y|^-RT0?�}�0��^{�8�8�I�;k��h��I��>ofV�������N
��������� ����������
1S?��A��u��VWk����PZ���E�0d��S�@����>������b�VKOi[���[��������q}���A7�������s�m��9�>���V�N�"O�g#����\*V)�z��[t9W�]{�����=t��C&'�K��F���27���N�K�D/"D��^�C�4])�{�4N|����F,��o�Pk���>�������F���q�b
��
���S�y��w���TEVC1&�3���YHp��5�X�+Q�u��W����J����j'�����LF3�@2Y�!��N2�{��ae%���X��=q!P���^F*�
'4sjr���lD|��P�"#��L6R;����7�����+�]��x3��Lk�+�C���; c��+�����<C�~��ln%y?(O
8���#�
5����Xa���b�_@�K����
@gM���P�_#�P&+�����y�'�����q"���"��iG�Y2�c��OMT���T��#x����i���>o,���@h�����.�~\*���R�V��hpUZ,�G�p�A�)�>8��M;.��������t��Rd��\�R�������]lh@���$T����X<���"#tW:S�Z���\����Q"Xm<�06ANL��(�|b����kB�y�D�	(���X���2�Z��rb]\�Q��R���`��qK�/	m��w5�����D���XG����@�^�v����;����M�rOk{0F����JxZ����8u�H,��)U{r`��D\%<�SfO��+5x�;��p!�d���� ���KL�0��
�{q����>Q��6B��F������_Ft LX�p@V�F6�E	�g_a����2�nr��O�1w<v\f���z�1�@�/�_v#�����x �������-�1�_�/����e"�D������_iS2����*���~���:y�?�b���[,�B���@�by4�b��O�R&�hE6}�8�����Dd}=����fO������>2�@���O�Z�O%<������g��Dyr�����quSd����W�SH��#��	�dR.���
X ��>���AoZ����M�:�rz~�

|�9���5��C]�X�0�#���r�0�(%�ta�b�1.N�o������Mk��>�'��em�����Gi�F���>R��y�T��5�vI������F�.��2����m�a��J��C����j��d��M��S���=)Um_)��x~����%�AqELK�p?1{��l���z��;������z�t�{bQ%I^�xn:�`#J��Tx��I�5q�h���@I����5������I�z�eH���3P<�5����C��y��M^�������pe��O4����
S�`*�0��j$i�����v�_�c�����M*h������k�j����4�d�T��1E�YC�7S����
 �A�X�"=���1N/@�a��P�9z&���x,�Uq��'���?��V�O��n�,a6����n�M�#;�a��nm�$�x'%�����^+�!P���8]�b�o�C��a%�	��Z��#��>���F��R����i~�:�UCo�	(ZX4,��Z,ZOu��~r��������4�w"�O� �|P^�1Ymbo�����
�	}�w/1bZD[�!�g{4��������
�&��)4�Q�-��]�-r��^�
��#z{��p���e
g�h�~<8`�#�Ryuj��cl��{������-������f�.�{�D6���
���TL.Q��:�.��m�@�@��b�#
/m)�����qW��G���Ah�����<�]��4��83o$P������62<�����i40Q"
W���3fR>����2�<�]:��`�.&(y���x0�>��>��M�Y���=e.�1��%�;h
����
a8RB
�e3�!�A(nS��I�11�;�����}�~��W�*gFJ��Z��	���4(��x�T��~��juW�yi.'@k���x3���������'���^~��?<�P"�e�(0���������&�'i��W1���������A��y������b��V
���j	C�
	���x^��Cj x~���8]��$3����6�x�����R�����e�W��E������s�����
�D:�Z�8N�t4s�
W����
�/�C]�Y
�Sc�l��N K��?(��=�"���K���&�
w
�S/�?'6|�y��h����+�������Fby^�2i��]��Oel`�$D���7���"����T%3���I��P����
��Q���7K��1�d�;rC~�e�����+?h5DV`��]����B��F,:N��^���������I��� G��`�>}w_�sL�>�D;z��g_-���x+�5�9�����	��9!�k����Om�b�����{��	V�6��2��l�T��?��'��F�<���=�m4���=!�(���b	���������Y�����O#��j��:�U����L#=F��'	�L����C��T��)��*�u�*`j��N�^j���e��S�XUrN)��7����Y�D�����mk�~��bb�w2��N����C��9��j��y�c�b�������g��L�`j�S�x~m�6�p��`b�}�"�M9_l�M���F��n���uOUS$��y��o�����P�!�S���5w�&��~�z7K��������\���6R��,	���+��
�#���y����N����V�H��O��PaECgV��e�u5���������5V�mI`L���r���c��A��{tw�=��i*��6���6a��D|MA&(��-�`�W�h(A��@][_E�N����I�($���&�c����2B��:�\){#�tU��a`o2|zX�As�x ���%O����6����"����du��{��=8Vw���>��������DPs����Uh�lD���#�4���	D�H�B�e<h��a��4�kE�iw�c�����l�o+���8��Z�����Iw�xG��'�@?C����2��+���j�C��r����$M���.n<ji�a�	������k���l��{]��|��G��x���r�qC]=�����?���G�R�S���if�@� ��RY�>x�����)�~~�����V�l���PV�n����i��������1�}A���T�������j�r��3��x)�����o����W�sX��&��@����>����
Y�s:?��{
�Q6����	+�|/�P`X#1$��������.��=�ik�Hc���
�!V�_��[��4vBx�K�~��H��A2�u�pW �G.����T?YI�y��e��P���o��o1'�s���{�j�N���AH����EkZ�z��3�Mx����j�Ag�xV�/�ar��7�����s�����v��xt��GJ�?K�����-K`[FC%���OL�4�X-fNX �A�r�vH'���C�o_��S�_;?�)pG���K�AX{�6�tR��\�Ld��Re^��*f�H*�����cxG��������p`�F����@��h<�e��9_{���kV��A��8���g6����J (�]��U4$s�!����.�}���|�������/��(���t��r]�9��������c���r}`o�c1�Y��N����_���+�8*s�A�a�Y��������E�)��H���t4�"k�E�����5xW@{��75�b8�pM�u��V��M��`�8W�^���b�e����2��a��$�S&��C�|�@r_��W�2�&0F��66(�P�cI�hi��[snxA���uU���,��$���W�O�_�<1���c��1��
	d���{�uYBc�Q���
(�(��*�t4��~��U��;��i��������aee��P�w���n��Z�,�Xk7��l-�o�����O/�#����b���j���'e.>��v[[J�p<���{|��J���Tt��4^�4���Z��s����*�C�^�Q���	3?�8���@\��n����;�(�0�P����X��mW��deW�;~�4���E�$h.���%����z3=��m����l�<o��Q��1�la���WO������0|�a�|3-{����)kg�kg���q�	Ek�n�.�!Of��O�Z(�"E�k���Gt��An)ie�{�
��J^1>h"F)y%�b��DWTH"W�d�5M`"���S�8~�C,��e`;3V|�m�q�[�N��
��k"m�YE��it���E�z���*��s&n�������^Q:
�`	l�?GN����hc��U:��|���}3����:Q�&_0K��
�0@L��U��K� ��w/�RMe;*���	�����
(#1�����DU�0J�t��T@Ou)ci-����:Un���6��6���;����31���V��a��;(���5�w��������`��>��PC��>��24[�6�F���@��������/��tW���C3��gz�,��Cn�I�N�������B��f9W��F0��V}�8�;���b7A�����UE}��c���Rdf������ �w8�$V�7��Q��Y�Y�^I.H�<�����G8��J�
�[������o�<|Ib��?�����6�����
{?����6%P� �z�8��������\Z�XIaZ����)l�1���B[�NT�>]���X"�_���F�6N��(Vwl/i`�bQE,;Pl=c��Z��*�.���D$F!��}��+�L�J���^��Q��MuiV���|i�'A�]`����"4��`"|
B��]��	��N��P��?f��4�t��#���Q%2q�/�����1�+�6:����=��*9���-T��r�$�8�
�����wO6k�vF��-�/;����.=r���w�@�,�@Y��Ux��:/�\���=��'M���H4,�SToG�3���+��Q��$��y�$�K��R������G��"�U����`��F���$,���Cj�D��C�*yhg������*�P�S��o�Zsz��D��:L�-��,�O�4�H���Vjn�������we?��z'��<��$�4��l���$���U�	�2����cgt���@F2V6%�W������y#h@&���2Lm35NP�7��"z�|�93�9{�M��|Oq)�9��J�}�����)��z���[K�F|�n�.�Wx�)����^��|�%�>� )xlp�[�!T�Xa\�$�f�
(��~�E���F�2�/�"��GB@ �X��aN���8��8���jz��g���e�rf�@�+jt�<�c=cV�w�,zb�M���2W��X���1c�AUhL3��W0�m{V�KEBcX���;�k��6J%�D����p�;0[���4���p���Y'+����[$�7���}��	�����PG�3��%�Y�KW�e�!��L��	_{���tU�%�M��F�T���/SM<�Q5�5�+�9'��L�A;`G������?V����� 5��.{�d�Nd�R����pX8����k�]�g���h��L)���OFW�]|���[S�E�V��=�+M_v���U��^������R.mwmz���
�S\n �3,���-�����L�"�����o���������A��t2�>��6���o�����v���I��`��v��+��D�CF�<�q'�9������[���%0*?�#\+�:8�y�Yy{<��h8#��/�����d��!A��v�(��)#2Y
Q�|%�-P��f��-����7&�DO�L$
�("6E�!�Q�����{w��6��y��9��3����������t9�6����$�MP.�mhc$�o$;@��Y��c$pwgDcCV��L�d3�1��3V�cZ���C�>��C�a�f%r���X���P�j)�
H����!C��$����Q!ig���5:����AX�T$�����x3{��O�hTrlyd��y����
��\B��m�>dP��5������-3%`17[<�X0����hQJ�����k����J8���@2Cr������&��k������"�x���]��&�+6���G�R���#Xb/(��������+�.C�a'`+��5��Y�
iz:��t��CAMH�y���D��P����a�n�#�Ps#���Y���bL��;8������p���Sy�8-����S}�4�-�n���5a�u��I�k)|��^�xn�M���e��SW����qd����S�O������Y�
�J����U�^A���[^L~w�6�Z��5��lC~�������Y��������|,����v���_�z|G3�;�V�x��m��"���{v�)W����V��tgC#1��OSU	c����)D���Q�*��Rs��.�r]�b��A���������������'el�w����8�%�N����/%Y9��1
]~J���I�KY_Y������j?��`j��e��@g���J\knzb�qY�]�%�'������p&����e�����.���NH_>�@��)Q��u�E��m��d,u3����o9��K\+%������ZR#U������W������T���Z���M���d��3G�3����_��jI�.���M���Z,D��]��jvN-�k�a�����+�N��L�'HJ6�.Gs#)�x�V=�`P,P���g6�@���%��%���]���6���5�6T�������@�'����bT=)J�u�U�A,���"�N������L����/@����Co�+}c���!����-�[Z�C������3���j��E;z� }�x�4
%�du����������\F
�u>���?"�69���erZ�c�2�V���"/�����
�|6h|���7n4�����t��8`m��O�����	num���t?03V:ugy]���������
`Lu
E	�f����~L>��*'�%1���1i����3����1���/��
J��uH������l�D7�0"������c8�`;KC'�05��Yq�q|���D*7n���1�;�)W?'�����Ujg8?�����#�)��`
��J�����D.��7������G��\v�wI4%z����w������D���������\} ������L�%���!�m��]X���wX��q	E����{b'�����_�	�i4O��P�'W�������]0q���X]+���g<��w
520�io�E�7A����J~��6@���Y��-�����gs�j�������p2�����Y�r�(�h���e�q?�I�`�<�|���`N�)a���3�htC�xk�k������b�
���&n�p���V�.��`������0!�����H��:*���������u[F���=��hG��;�����[�-�}PB�"�~)��|r�,��oh��X�,�����eF���G'v�v��j�����[���XMo�n�v��p�S�(o8���Q�����[,nAOS�W6��'�c'�E<(rC)=���pcp-���1��5T���
>�������d�X~/����c�Awcee�k�kt�&L��H���qX%�*�2�����}
�H����2�Q�!���t�g���4�Z�>�8��������X��������T%��)w��@N�����������A�� KI��+��	M���C��KN�������{��
ls(8�s�������oH-�����-sQ|�aMkaW4��<A�-<����������n������/N�-bM��<���Y�w|��8��7M�s��n��*�1��I=��:���$���/Y�n���g:��!��w���Gb���e�''3��V�q�;�r;��z�!����G��s�C��/�o��#��uz
�J�uje�=��B�/p�i�f�L��\�0�'D��]3Q����Y���:-���S����.+��/�#��������3)�u/��7�z��I5�;��8*_�UP�=�>]�u�/�w�����f	 J���]0@���)s�C�H�+�k~;SP+h��=������K���e!���b�U����������6AU2C��5�:�@K�~q���/Khb1���G�xv�Hp5;���Yng��{�v�NC����)���`f�x���[�����%�m���^��cT5���2���glk�H��7�P�=��P5��
��YxE'����a�)�"�s-�7���h��W'cp|��9)������������B�Q+�Y/���21�������vg����bJ�Bo��8]d�:�>�iV�u=B���P&�g�
oc�j�O ��N����2;b�����6R������.�1(��yK	Y��"��q�b�'����>�P��e$��107����Q��7�y��{~M�I�N�g�!�&��+��d^���xm����4���^��"^�/�������)��'<� q�Xu����D//��(�����Km��[�1;H3��u�Oe{O�*,�pV �����F��VxO�5���y��	/9����~�����g�,�8�-�_�y����3�V�qH�{oD������By������9������r����o����I���]]P����V+��������O���0��,n�+�bZ�cg��< j�{�J���O/%���g����r��#�W�cn`Q��2�^�I�>��/��})�s��}^-��F�������[�]��������^N���o�cL���W���/��qK�3�.��u�<���
#���)�����i�������YO$���$4��0�_r��g�����4������^�Re_C;c[.0z���4�z�{��4��VB�\��N)����
�Y��[�������{i��f�)��{�����Y�����uh��+{/�Dj����"�q8I�M�����BJ6b���	�uA|�P��A�6��	�m���=$'�o�p��s���uX�i�|"������=��<�$�v��@��b1AC�O�\>��K��H�7K�_�,���c�D���0��w�.��t��O��h!|��AJ[X6�
��|�QFa��(�j���HE$�_o;���*d�p.P��x�]��j�m	=�
�A�����7������Dav���D����`��X�u��iJ���
�N���je��M������0Fo*]�����	�����g�~����dLX�4~r<����T�5"�o���$��d���w
�" qJY7��.���w�� ��������`�����/j��".)���oV��Q���4�0�/��Q�M���fw~b�#/?�(��	�,�iPY��).���z!
O4�5��2^�c���y9�X:G���
O�b��y�= n'� ��pK8HY}�x�}?�s&���+\�D|��hQ�Ed�o:��	'Aw���F-���c���G�^>�P�w5�#�;0sw%���)<D��no;>�
�7�W�u�|U`��X�����4}@]J�{=
�h��h�`�����0�~�s
8��T���5}��O)���5?O�Z9��#��5���-��B���2����5~@�U�G�R�k����DT�AU�^s1��t|�6B�ey<��x����-���^_�&�b�C�tV��
m���Gn��tF#7..���p(=����[R$��B��\��uS��1�b\C�>}�>����h9T1}�T��J8�o��zl��w����P��������HRkY@��v G����Z�y5��Lwj`k� �ss~|��pv0@S?�R�YD� 6�x�%Y��'�&y{����o�������f:��VB��B���B3�(nC�������LU~���wS+8{�0�����y���K=���r�HK��?�t7�E�Xg�i�W��jH]i<Un
���2/�:��C���$����q�;��(�<8_pd��sE���i��y� ��
V6��PG��,�{�*x��7�����X���h �+jq^X{�Ej��,����t�06�,z]n�P����LY���a�g��������b�t�F���7Qh�������������$�]9zW��zH�:h����|���o3��N#���!.�����������<,�Y
��~z���������^����7F�uT�`���_��8����b1�����H�4 "��C�x:>��M��nv�<�Pw*$�c��X�e��y��*Z��E�c�P�;��[��'<������xa�^]�Q��O����)��1FN����CPm�`;�p�p�#�&�4/+�����j=�G�s�W�F�5���r�h���#�e4�@��A�
C�:�,�U�����5�����e���\�6��]��X:����D�����1�5��t@�1��$5���]w�AWSjH�<��w�},�����F�$Uy�s"0������d�!�={&1.�*��A?�a������$��0z��G��MR,p�3�����[uE��E�>7��.��,��TK�+<gHq��-0�:��ekb��~8�w�����bLl��������&eT�e�
$g1���]�]h��W�]������hG/n���-w@EQ�l���&t�t5 ��6M��M6n�P��\����Q����hl[Zn���K���cV��g$2�Z��<?n�xM4��q����-_%�aU���y3���`o������4~�i"�^��j�r�'j��,��a*��p>s6Bw�si�
�}�����d�Ykt�s��=m��7�XX�]���Fx���_sz�<i���'<wi��^� ��K�Xv8�
�:����+������>)��#��/
:�O�\�OK!���S6��;�������{p����{��*��z*�����R%�{K0�`��h�mb(2<~+4���z���e�oE�o���7�6eB)p��f�����kd ��klAZ��J#���M���������m!X>���8��E���l�!1
���H������%Z*5?�-k
��l�(��<�����5�B�
��d}g������DCQ�7�M���P�,]l�oV�PT����db7V��ql��	4��"5���M�'R���4;�6��|R���Go������Y{k!d�v�xi��Vy��������@x�D�7b���4^�� lrJ�s�k�k�x��~|�XU"[���+]�g�Q?�.�\������4�5�Z���]��hO|�!��o�:���;����B
Tn��5�e�4���\G��m�a��o����
B@E�e�2�=���&6��x9~_Cu����t^�
���t��<C�Z�u0�35�������UtY��^��[���B.)v��[�Q�"�u�u��&�I�n������������_vJ �����|�cmN.��T�!���%2E�9j�`�~z���#baB2"��I!#���Z�P����I]'a�_��������D���[\7����l�I�	f2�C�YNf��j<��^�Z�N�w�x)��;e�u��6!��R�4}�a�)�h<�����?����N��4����L�C�*������z2Q�R9.�I�4!�'���t��H�2z�,'�P���!��M��w��]��&4���f�$!����x�H�o)�7�48���71\����G�T���6t��������^�R?�O�nZ��8��Jdq�h=E�e�7�
����>�������GND��5_��e�"������6��z���9������eP��'��N�[GP�zC��5t�|��4��S���<Hs�\�:���B��o�-��������i��D�,�,X��7-�����xk�NV��Y3�w���l����8��=;��������7�#���Q'�b��f���NKzV/�kUci)���q���j5�v����;v���`Xgy�i��O���.�9n�DG�(K�������:z:���ck���}ea�� 
������������r\(�u�ZbD`�615Z��?�"yKM����p�S�}��4�8L�d=�'09��H���_��Y�qW�M8����\���Se��A�S.����y�	�����]��Y��3��������t"���L0�����Y_
7/���sN
�zz�
%��K������@�2	J�?���6-�Z��3�3LLVB�>�������TK��h���<�J-�1b�gcW%���&�x����H��=���	*�� �$�k�_�V���@z��F���)�����\YV��aR����j�r�������RnV��j��8��}&p���50��6�B��*��\%�-�Hb�
��}t�{�7I/_��g����s���|��Y�
�+��/V0�����r�Y/Z������vCYJ
B��M���f�Rn��g���\
^A��^��s'}`->��Fo���k�hm��-]J����2r�#[�PU�/�����_a	4><�����'��!
u�����QAR?T��Y�r���������|�)T��C��[��B�Sm��5����f8���M�?�Eo���t���������[Z��������F0x�@t����^�k���71��=B#��k��������{-p�p�;����9�P��-��h7����&�B��(�Z����E:M�Q�������]T�S�� +�M[k3�u�3}X9��8����Y0�YQG��u_�}�����W���0�/��rK���>���������l�,�����D��QGF|?)W%����z��/�vh�[8��������9/'�F��_Z��g�Yl���A-�:<��R��^-~G�t����A
sb���9Qm�l��.��P�g:�?y0H�,��T�X�81��������e�Qa!_:��������
��J�?�7�uY���J'��V*�p:nn��f�V��}�U5����.�c���'+����t��oL����E�D��_F���L��`
290�_��s!�jC���9'������ul���
n�Y>p;�26k
6��5�Pd!�*D`���f���3^�RS�&�n��-��&��x������(j8�{�������s����g"e9\_�$�Xa��_eV-�N9�$���g�6V�����D6@��/�
Z�� �K	���n�'��H<0�	�8���b/o���S��"������l��~(�������2�,B����x$�C���&Ii��L��F{�DPLT��(�����J);�Z\,Un���#%�2��W�C�t�|��3 �,��9����5&���/�O�d�E_r�1�X��/��e�P��rp�,r�O��bG�����Y
n�i��M
'/��a��Wpp����by���������k������p����a'��?4�n��
a�VL�c[��*������)���7yc��7��1���<k~ff�cz�w���{}�9T�@����_�`U���t+!5�!���q�*���u�9:`��
9�0�e�C��q��#;����L���O�����.��>���h{���c�6���f���2��0�D���P����b\v��0��G������W�O�;�7��j0��@m���2�u3`���	`�E&��I.��m�,`%MIuh,��RF��������Bs���'?3��6���9�T�%���6�l��y�� {X�t5���P��N�
u`N����y
����f�����d��e�������9L�#5�_��
n�����v�.�=���K���[���a9f��i��O
�����j6��o�AP����o����4~m�j�W�z��i�^Dx������r�w�,�ZA��C" ����f��z�n����)�����70��n�O��[���f�����w�q�o�=��P���
}y�m���1��>���z%��3\���& 
|q�Z�\SO�����o���%�5�x�@>8�'�����g��&�'\}
����?�v�i����r����{�=-����;/���5]�w�����}�E\6<�h�-�����M��S����~�0y'"*)�|�-�Mw8�2�:�S���@0�{���d��
9����'���=.���#�k���^���uA��'�]V�nCx�}����B��$�/�"�0g����;��*\y�������M�X�z�*��H��P���T)�q�4����^�����vo-�-��������G���2����|�K�(Pr��$9����B�yj�^DA��N��;	����h���T�a4���OF|J���	��������]&�����n��+�V��d���;�Vq���3�*a�Q�0��7��:o������+��?j��:@s�`���c���[�c��YL�$������{E��&l3�P��x����"�+S��k��t�k�����u����	�4Ym��@�7�q#���KP��3l��&3�.(;��GgE����7�o��bY��%G|w���2
����/K\/*(�#���e}�$��TXW�8]��J�n�����#�Nq�+�u?��]�f��E��P�(�f�	�]����h����c�]rd�{��4cc<_���w>���zq
���e
JT)���/�R���?���<���z���-�������|��}�q����q��:Z�
�����{�Ab(���4^
��nV_3o_^C���k9���GQd���W_����|\my�u��~���d
������;�x<�f7X) R���M���6�rG��o��f���t
5\�o�qH�����o�YK���������������S��?I���iT�\�zO�^{�?��`��=q����o�M����Z�-���X��pV���mJ����u��=,r�6��"7�X��������P�*Vk����-����V���#��`��] ���{1�t;��Z����@`�[c�����L`��1a��o��{��~u:os��]	e���;l���(g��-�4�������2������7\Cp�G>��67�_�;��z�H��#Vw�r"p�%�N�"n'�>rD���}�t
G���2p�������*�T�����e8Y(I)�OZ�x@�_��s+�W|�[��Z�7�g��c����
���� �����@�������!�?����x_�u?a����IP�
�|Q�-�v�=���:���O��-[�W@Z j[�|k7xlC!�*�Yl����b��_a�U���#N�G����������wiZ�@���+�3Ur�PqMZ���KQi�w����� �W��}!��1"[�@���x�nV�))���=���B�0�g���"��;���,7W%@�lK����0���_z���=�����9��^���%f?�����:����RWL[�4T���l:���q	!���������Os�T�[�C�|Ni�^#
~�	t]�?RA�siP4]�`����c��>�M&���v
7 �)<�
*�eA?#�E���0Hd^sU��#����"�6����5�������[��EP�_Q	_�w����d�����B#^�y��>�����C|�m�����s���WU�U��;K��~���*����������j1���G`sC�����t���oV��G2o9v(����u���������sZ����{f�l�������;��I�/&d��C���
���D�g1�o@��^�������s��&���b�NC����F.JA�W/�uP2�w�@���]����
�ls$(�e�1��'��cP&�`(�\C�{wo�c��d2�#�2&aK��2����� ?��U���R�>���
�;t���F����������
RD�(���8}����P
`k��W\E���kJc��x�������|����(�U�v�/|�yPbU-����F�;a��@w��_��*�s�q�,�m�����YK�L������(Jj�eC������g��������C����&`�t��C]��~x@qY�+) ��M�����?le�2����*���/����)�/�2P�<�C�Ro
������0�Nl��{D�4N+�����`��?�84��-�E���!���������q���Z`�����3���r�8A��S��ex3��M��y�\��%���-��"�S+8�(���x�S	J�B���|�=b<�����������������Q�4�"�$>_���xAP����f+^���o�a@y�-�����Z��H`�M�@�����J�����)��BJ����FJCC�)"�$y�
	8S����h��1i�����w��]����� �u��\�P�J�?�72\y/���AQ��C�A�������%\w������!��/��������y��\�����o��!�(���Y�/�b���x R
�&8D�{�����#}�@m��'�D�����i%����l}�+�?0;q�.��+S�N�)��g	��j
?��7),�����+��E�#�D�NzOP��`��oU@��P��v��*������������m�[k)������C���#K�H��_���>��6x���p�sDxE��!�<2"<�N�}h��yt��ty��a�������>4�������M����Jd����R�}��y������P����Hzx�`3~����x���%��no%��[WVl��[Z�p�?�M��*m�����`N���xA�VyOy��n3�U(���@��(`�l������7�)����I\B���A�X�����T{%���D�m��YsyxX�e���>�E�
��������i���-�[pT�i#��,\�wJ�@"�0af��<�(DP�
w����3�'��P����������?n|�����!k�����������]"�#�"�c�q�c��DkD�)�?{<�jJ��z��M��������N��������i�g?�]�K�G�n���~Z��K����lv�Zm�)��'�m�xu��&�]?���%���uR�*o��Jo�'9}&pO	�z�����
>��ek!e���f���f��Td��"�J��<����h&�h8���"7�]{�bY}&�9�_>p���O��
��/��n�H���������6��0m�d�Rz�2����`���?�k�z1��6!s�C��c�e����r������1�l�D�]�w�d'�uq��f>�[��F�>�0
���.�?��r���|-�u�����K���X�;v��uz��b������zp����JF�x������r�#�Z�zU�w|��C+{������A:t�\ziD�6�7 U�������  ��lp�O�2L[���D�d�����"PjWw����&_��	�=����M���EOg���?�����`���#A�%�f��-W��>�wM���3��Ir�40{<����r�{AC�EU�gu���YL��;^��2�h��AwR��*��^G��NX���j&��]��� j��/��~���+
�55����>7�;0dk��"f
����|��x�n�!%��U������y������0x��a��?����|>`No����2�X�E�5��9������Z��A�M����/}����*�v�.X�{4�8������n�q���K��Z�p"Zd{m��:���|�Tt*<����|1C�`�e����=�X'sC�*83t�Pe_�4��d�T[c�����%z�r�4�p&W��P�������e�����*uU��&�Hq58��4�t��@�����F&������p#*�4���ASc��-e�O"X p"g�e�����J�Gx�}�Py��,:��K��{�w���h�e��<�����_3}���n�o���#�v=�z��HiLQ�m��5M�Sa�q+�y�^Gc�z"����|�E�����T%�^<��J:��U����r)�-�m{�7�F]�����D/�L���/��w_��.���K�`Ce4u#�;0�:
=�6��R'(��@��s�.�����[��Zht6�5��"��������o^S��?&�J�n����:*��b\�@���}�{����(�}��l#3r!I�n�e�Z/#�a�4A��}e�Vi:u��)�7D�2/�
��6�m5.�4�r��� &6x�|�����*�	4������e{i#^T�)�HdUse>�f�8���%��2
��7��e�!b�p����pK7��;���#s�H������\�,%�A��T$���l~.~�10����Y_�e���e���e���@>8?��$XIVu�xk������J\2&�Q��v���������B-�wy����21�����
����68��[ �8t#�6�[�����qD�������~/�������`,�k����C!��h�<�o:���s1
$o�������8{�>J��;p	;J����n���`+-W��)Pw��V0���������-�o�7�s1x��o0�����Y��X0f$}��9��+����7n1�.�pq�t.]��]%�����AJ��z{�bT]]\o��`����
�v�6� <��S�Ks��e�AFk��{����HTn����l@��AkbQ���n�����4YO�)o�~�������
��T�Y!�,I���e�Vq6����O�\��s�O��D����n��h�Xo*/"&����iO�y�����z��6Bc���
��l��5�!���F��
��r����qQLgR.BP�Q(�������g�:�Q<Mg�e���<(�	P��IA�����r}�v��O�I'Z����lop�(��(�o�>��F����a�vj��E��p5�����8<_Z�f�X��	��AJ�i�k�.O
6h�@�e�SG	���p��NB,���&H�V�fC�	��WU�7���j�1_@Bzu3V!�e���v��&�^�l���t�r��3Q
h��P A%v�!IS���l�|�����n<�����F�����9�������-�no��xY���4��c[�uo���t���`�_����)3���l�c�!|���"N>�����o�d6�<N����h��/�CfO��mQ�QF������UoK����*�^Z���@$_R��[���r�emE��[J6eY�l�9�E�{<%#e��xF6�������<�����2*�0�
%7�A�7]����"�,�S�����EC�lC��a���j@W����7�����|z��O�h��\����v�9�60�.��A��L�OF�0���|�����q	��% "@����2�K<�
(g���A����eo-���f/��&�Y��7����|�kYg����������6��,�q����8�R��,�2�|�H/EB/��=��Ns���vk����t��q��)��P!�C������7�s����f��D�eE~���� ���G�^
1O�
�I3�"%�f�VXO�sp�%�;K�������-�:tv|.�$�b���}CO�2�M��'��
�x-��B��g�4P�6tv�e[izj����6Ya��;��'"�q*��o���_�P������k��(��h����~n��^	Tt\���1>&b����
m(8�����o"����E��G�Ky���������!Y��mw5U��t���'��n�	!���/aB����#��A}[ >e��JJ���
��m�&�Yl�z��n��;�
�H��
� r������I7��3S|�2O����+%(�/�����w{�B��u<�to�E����X���78p� u��~r	yCB,���L��|�����v��

���!�?�%,	�G���������v�����D=kG��&A��&ba$"��MQ��<Hm���F�_���S|�?��Hp���;7��S��8P������yh+�25fH��Q��dl<9�~��O�=<�|����B�|�<(Ui��~|%lE���
�ei�����b�@�q:��c$J�N���o��R�p��O���,%4��E�|6��!u��.P�VX}��,��3��L����
�������!�'��~�8rBq��D�e��),�z4��
�4%G�2�rH^g���O= j����W%2����D`�9���v���7��?i����[�������3��pT�r9O����}�
��Y|���41������2���I��^aj5b�Z���g�=�����)@�����Q���*=���_~N4�}x7��`�:r,�l������H�Q��`�;O9����5��E��ou�z�o%���}[m�����'f8U#H]$��	?��������
����E��69���-�}=_�<������Ndh����H:��[�O���6�NQ�|��=cR+��O�l��u��2\��o�E���6Y��En�oO��J��C�g�~��g���Pyb�����b�?����4�G,��I��R�����{�g��$R ��Y����w��gFB�������:��������	��#R�� �#�d�!�{�Q�6�������g����������}T�0L���{�W����MT�(UzA�*y�8`�4���R�nH��n#Z[��B���dz�
���=yd�4G�E&x���?���ex��d�*��'i�U���%�I�_��a&�-����r9��[��|r��EC�YL�B�2_�*��tz��C�2�KA>���U���-������%U�NCEU��x�����?�op��������i%w��@�B�G�l��H����Y=p���bi��n��&~�k����9�u��t������:�`�����0�-��&_��]u3Q��U�a�Bs���3��!�k���?��m(����F�hM`��������j���>ka��^�l�p���J��.[kV�k���)�j�`��B;;���t/��|�_\v+5���'\��*|���%������z>`�T�J{	t��+����+�4p���w�B�

��;����,���$����l�N��9����e���.�8�+L -�NUQ�t��_��YU��=��:=�
�E�?|��!	K���H.��������K������/Je��2�wb��*��b� �����N�g�b�z�%��N	4��S1��|U^��}��V�E��E�����_~��GqA�E��S��� �0�P���|>���cH����Qs��Q���W�!��%���8������a{`�+������uC��(�Z�O��'X����pc"����_E�����5���r�]����O���.E���aLE���3+(b�~�����1;�j#_k�x3�$���U����m�����N��u��������[�?�Ek1;�����.Z�On�y���3����au��������d���lC�0�A������OZO����2����^����=���J�y�I�!Y������Nb������{nz_���P���\n4-�"A�9���I�8�LUBZ ��O#��*�(��o� 
���:
���xV9�c���l��g]&���Oc��eJ��2eo�x4�	���%J)6���������hF-!���e��������������������@&������!P���I1�5g�zJ�R�U��!7������T��W�(����YcW��c�������r��G� �������v��}AX���o�th������7�^�\�I��G����Z���G�X>�������o�G\	VY���a�%����~~�8�&K�5u�_�������.%X��0d����Q����2���i��(����[]�dn%B�q4.�����FOo,���!���d�wx=�]��%���bP�����Ix/�%�#��s���K�^P���0�$_��L�����z}�m���q'���qth�������r/%o���	�����x��Xi_��yj=`���U�PN���/Y��1���Z��6M�k|������'e~�}����xe��������w�/l���/����*!���To���}�kE����E�p�0�"�-	A��N�	~=�(��*�z������>\��5I��[��%�$���BVl���F/>������|���-����d��/$���D���E�������\�q4�y��4�O�S&����k����"�t�r�85;�����^�v�~��{7��q��X����:_��+�Cq���9����T���D���<k�(��^�X���#�v�6`���\����?���}�_����*�dU$x
O��'>��E���e�B���^
�����[���D��X����`�>�	�����x����"�*
I��-��	}�W���l-����>�,�`U}H���y�Y�I�PB:��m;�+�� 2�n���*�����2����2O��D�&�U�hEJ�O���_�O��An��?��uT���J�-�5�����'���n�����!�6����;9w��[����u���j���[�{���h�����f�j���3�2����1�[����L4J�6�����<� R>�3��EC�a����/��~�
&�8��E�+5���9	,xw!��3����F�����w�C1'��)�c��^�Z3���g��w^P�?�EOa>���7��2n2*�I��G`L:G��T�Cy*��-�	����WU��p���	�PE������`��Xh�J�Nb q�p���N��/���UkJ�P����c�2����#�����!h�������.����_k��_Rl���E�
�Xb���o��I����[�q����r�*�}��L���?�s6]p�����$��#�� ��Hgs�r���f�Fg�/������A�2D�~�����W�U�;�����E��B��*4�h\o=	&�"Fv'����I���@�Z(�EUl�w0�;�x2��z�B��C]�����
��S�/�r�v�uQFt��.��-*��H���Ay������7����~����u����W~���A���a����������3>�g�35��~�w�#z8�	��	�	%�N�vh!2�C'B��@h5������o��X��%_"�d�Yqz|!������|�F���"Z�D�O�'T���7�RG��������r��S���������v���F@C,O��*�����t|s9%���	�D���Ca�a�ZL�l��_rTS�
������/1r����+/^$�-0T���`
L�k\��%�Nr������������]��j� ��}c���W����/���\�C�K���L�6�xI����;��3g�s	�$:�)I����VB�u�X7��E����4r�M��b�u�aV��a�����[��/jw�p��&�~���FA���\�z�'��������o��b����ABC%��5������Rg����@�lA���K��Lh��.����uz��+����d�!�~o����[=��������[e�r���?)O{hz�u1�g��5��7%�pfE���?��������%��qw��i��G�m����|��N��r=��?I?�n�^��eNw E�
�!`���ME�D���n&!��X���U�s�Sf�00I��2�B/WT���N���R��Y��J�{��>�?w����S��;�	e�E�]�HPn�L.O#i����T���|=z����rG�e�����.l����6�d�V���V_)6���|q�~n�>��o�.�����5���c�Q�Ln���^��W{�*���-o����8��������c��_��7�!�F���D1���������v"��P�10�>��@t$�cD ����O����G]�[M��9A�*?��h��(k����/��6�#����|W'�
l��
0�C�2~�s�E������:�NU85���y�/�>��������)��������p���'[����)�E����j?���ds�+�% ��2j�FY5yb������R�dl��w��6Z�G�c�}�^k����}.m}V���#��u�{;{]��.�IH�/�7��H����rjC�hZ�������&H�����n�������#Q���=����	UI��|�������~A����!�A��W�_(�%�$�����-�����(��@�T�� "Kx��(QvF���>�����o9�*����������Z,��`u��.���J�VF�Fr4���hYI����sB������6(m�8��"�%�h,cP��h]^��	�	�IQ���������8r	�;�xFR�����7������9�|��vP��)d�2L\c*?�;e��*uE?��}����k�gg1���	/�>�<�_����$b �
�c��("6~��a��Q�T�C-�������{�"�\���4����h�ah$������'�����X-���t��Mb��n�Et�u����������V�(�s��@��_LD�B�w�s��*�X\x��O�V�����n��Q�u���>�����	������P���_��'����:o��x�mpIG��
"��*O_�[or����[�o&,���H����[����2t~T���3�"K"T'cY1S�x����}��8�3~9�-�E�6��s��"2�_��^4q��w�E��1j��X`�}_r��5�L�y��������Yj�]n������A�����a���K��?�3�����}���c�>����+
Y���a�/^�.Q��E���+��q���x)����'���a�>!+�C���?.��K�{2�k*{����3R�[��Z`���{O�3Y���t��,�-���r���$��g�D's�o`�j�~K��N!*��O�M�sci�X�������C��f:/�
��/����rW�����$��w��&�g�!w�s�\���������c^��8Ma���.���~����"f����T��l����sO7i�-7���r�Y�R�������Z1-9~��)�?��-��h���s=�����K��sC��1��+��Mw�$J3����
?r8��j���@Q*2U�K4?���mi&kg`d�������BC��D%X���X>;$�CY|���Y�*XP����5�Us|�	6����%�J��J�7�� �BOb�����J�l�[���O�X�,^��z�!K�$�������H!�\��ifs��'��c�U,����u��zp���5����6q�d��yIp.�lZ�{�����=S���$	�>�r��e��q�~u�����qu��j;��tn�x�������]��EA�!&��^3��5�9�\���X���(TZE�V��{v;������>=�'�s��V)���+-`x��}-��K�t^�n�lg�fv^n���	�`zY���(�dH��<sI��O�7*�������q;7.�(7�P���a�V����=,�UD`����FW0>���wmN;�3yO���Q�E�����/K*�Q��	\�}s#>.��	�������Bz�����t�k�V�y�
������Z�$�.��[�F��g��zD�R*������YA��?���B�^�G�'�4�����9�D����_4�wE��T�H��������+�xL��F�t)�������������T#����&Jx-i�y��5���s�.��j2�Y��d�d��C)`���R�������:�+��6�_���K�B��W����������'����<���Z���`�>���.����o��y6�k������~�C5�)Z���������Mg��xn&44n��Wn6
�9�g��U������"^�@1��F,��DF���f����hj�M)>�w������7|U�X�v~��������Dt��J^Q�������}��\%�)���^��/a]A"��R
�#�I��M����B���-�.1!�H�w����-����'�����O�D��ji��8��E*�+
������!�Yg$�c�"q�V��4�\���������X<�{U��k��8'���)p�_�Q+
��(�I�C� +s%����d@������{�{��?��Q���s�1�R���W�����H��	��+g�������I?����a���i�~��*���B�^���F�I|j�j��?U1z�����a�qM�0f��u����	,k;���,n�19�z��F����'N��8�)~{�g�����w?����t=�
C��6��eB8�kmtT�~/�:	���W�Eyn�x��n��������A?g�r��8��z��e�{�LC��!�-��cnH�,���������Y�!�o����}f�����$�sw��Gf`���C�{URn�?�3�1C�m�O����m_�N�%��S�-H8!}����
�_�U	��N�?d<�eEy�16%��J�l���T��<U����m��S���$XR��"["jIs0#������B��tg6���2�[�����cP�,��_
p�aB���z�j����_��x�
��u�	�i����?��@Mmf�f�o0����x�r��Y��_K.�M�����*q�B���?�0
B��;������^���F����t����5S���f�����=#):���t������bI�����S�q�]�������@��a����&������L0�����A��ji�����{IKHy����������������� ���\ft���3������S�N?�BN���N����k{�����3�hg��:���2�����]�h}u������]�i��Z��u�@��p���?���Z�DHh�?�n{���Wo[AA�"A7�tr�	�������O�Ep����t�.W.������O,O@oJ,,��������vr���g-�u������h�V�6F\�
CQ�sp&���eCw��r������qHmU��)���
6�A����)aI�.�"�z&��%���y�c,��W$?x�m!l#�9F���Q|,uF
a���
EHc���Y�I�E{���*�O�
�k��l�|�Dc��is����F|����yo�ZBO	>5
���\�o��vo��	>�fi0�B�p�!/�
P��!��I_6F�����?���E��jB����^������K���X~V�q�rGX��$�RyM�$o<�Y�|B���T���5�����m�f�0$�b�$V������b�<@�'���C�S���T�{v���o��������\��m��aLC1���{�
~���h$F�1l���|���c������W���g!�v��9�L
����^R��O������g�w@]Y�����m����!��
�����n������o�:�-��N���X���������x�
��fv�+��)�b���$Mm>+D����3���3X�g��+�-c�3��*�jnf6��M��oq|-��D���H$���G;1{$���b���",t)���OD��n~���x#�m)���/���I���!z:�������E�_���a��d=4t0�����rDd!$��`��T�mW������m��y�!y����Wn�>����Y�OE��Q�G������V���cK����KN_F�[Z0��5�b���$O�c#�*���4I����>���l�����cgrc�	V*@^$��J�'�c��{���/�wu5?��Q=��6\��mIJ�GO�WG�������!��N�����V��@�.O��Z��_��
dW�4�M�������=Ov����-����ru��K�Ya���
���-'k�����6���� ����4�;)��TW)�5��GDr�8�[���G0�:���,��
�|�+�\����{���4�9���2��.�0t�'�(I�%)tr�c�m����,���p��v
�����-���&�H9�p����8"[���E�Sg��0������^d�0v��B��\�XfR�
��B���zy�;ZS��	i+��V<&0��������fp��(~�!��Y��3���x�s/#]�i�@���f�d%vn���A��"�%F*q��'����#�GdF7�~B��rg�h�����+�+���8����_����������omJ��
2zB�1��&:��HC��r�v��0w�;��=u�������f�k��D�qA��N��/{��Jy*bh�P��,e�O���9�w�����]�"�sa����`R�l��@�G�d��QBm��������V��#U{�<��nV�%N���L�/���������E�%�',��=�!���S���b�x6���5S�*)�S������z���Y���|��t��%�?�%��o �w��cc���*5J����5���Bz��D�9��n����&mj�a�����	�a'o���o�.��+A������4V���(�N���rKl~.�6�8�<�w��<P��@�u�L��:��QZU��Q�C�3z�]m,��s���E�r�a�Yi@�����5:H�Y>�=�������?*6p,���_��>K���E���Q���[K7��SAU�a5\�~����Y�z�����H�������U���7w���\iL%��&z���H]�f��(�;*_"��I�i�7EE]�NL�
2��$
L��a>��@"xT�	Da`�����c ���R4��s��%0�����E����W'�o�w�1
��{(y���Ep��hZ�O�XL$�c*'��:9>�t>�<;��Nv�W�7b;�M�
GJS��
����B���?����p���,!�r��"������Gao�4i[�Jw�oqhiV{Y�;��%j���m��^���h��Lr\
��-'N�.�[*��	+�b*�����"���w|c����wv?;�R�����<���"(=���<�e&=�.0h!����`?7�47! �� �k6��C���;Ad��q<��^�P����x	,nH��P9,dz�e�UT*.cY` r�A��KHS��W���2�y\z�5��a����\�b������R���k�_��i�U�
��*����{�{d)D����Mm3�zZ��-�|u���������$l�I�|���JL��yR�����G�����3D�e��U����)XL�21����Gw%���:9r����W{^	"8��|*&��G���iJ�:�4Mg�%�Oh�b��vB^�2���HuX�S��FD ��jV��Ef��+2���Z���s���-T�x�%�vBg'�"�����)��N��0����*:�I��[���R�1��������u5�1�^��#!���F��.8[���| ��ZOe��;vm�.��=���I�_��
�[������q1I�{��tnJ���D;��c�.�%7�
SOQT��	��S-}'��'9��X��O+�28
,�8*WZ����������?�F�������U��c�������-e�Of<�94�p�N`Y���9�0[��n5Ola������`�B2��������8�o^�FJP������F��F.X�M�kV�����������J!'1��W���!�xN���ag7�(�d��]����9'x��y�.�vB�8R;���9X!VWEn?�b%���_�4�=���9�����)b��P�k'�{�\.2&�M�C5Z�\Z-l���(�8�zp���p�BTy�����W�!!�@k���8�����bK�R�SD��sBW�~�}�6r�{�a���G���+����X����:�����y����b&���s���8~�h�mJ��(Z�-�p��1���F�Z����\U���G0�RHD�U������#3�����L_����^�h9���?�4=�)�����vJ�g��F7��C�g����VS���	����+���I(���p��3�����4pfg�?�v�M678�����>S|��s
:)��-��?<���L�dS���GE�x� ��Z���"�D�H;��GRD��
�X�����A%�P�����g�F�!�VZkw�6�?n�G_+���)���;:�Qa'�.�gu?�T�S���z������:i���I���&��6�\;����M�`�e9���	u�N/�H��PW�=�)����*.V�2�7)V!�u��ZoL�OeH���"�B7�B�"�l�"KkZ��R
�h�m�����Y�a�������G�y��s����M}c�����}�2�v���S�X����3����V�v���w�9�
��^���M}�^�3�������S[U/M!VC
��oNH��|Z�c�Yl�5{���������[������k}�.�I/������ ��n6�/���=�{�?�Ke��O'_���A���V��:�� n�'�A����ka ��:���������'U@�m��
����.k�+���������7lAX��77�'^�`L��[������� o�������E���p�*�m�;���@����::��>\@��k�v��=/�7gc{�E���+��f���e�����&0�7O]�8�����'���!AS5+$�tP�`�u�a���	*3�	��^g��6-��v��q/i�d�1Jb��l#�'%R����7��|��������������)����mY/p�_W�/k��v0�A�4�`�b;�a�A�r�$���V�D��N�Ja���~U#�8�@�����������-�������q#�q�d�	�y�K|���$Fk�O�:��d��e�;��`��=������k���O���J*��%�3Z_l����M��e�V��Y6����]m�OW���I)l�4-���d�k3���ttj�V�9�5W�^&�u�U��v�w��q	{�6�<�O��3>��_w����w�5kG{���7���O�+UaQK77I.�(Iq���q��nEe<J������|�i%��X,��/Z��vj��x\��[.(k3JJM��k��@^�����������Pq�����3��e!��s�Z�i�}���2�X�u'�Iy�
	�Ld\}���+�]-yr�QX���*��F��).#
BnTp�����B���b���Z�1[0��h���}#���z=�p�\�����s-���a$�TWz��W9�"Q�mu�c���zb�tk���*���z�E�(�E��e�@9��l���O����El�dB�a�J?&�
�L,Yh�Y��e�r��������57��]F+-����X�<O�����pj������M=K�[����
Z9�G�xGwr*#�^S��f�&3o���/{]Mk��#��"������L���g"S�O;4��.���K��h��������*�:K���P�seM����|�����
����B��]U��o�Z�l/=8��)'�*�������8��(��������w����Q�V����@���V5=��~_�vNZ,��X���/��|�8�9�7������i��9����BO�������uG
\�V�c�p�)��k^f�&<c��	�������������}�����<:�>Is���p����4	U��a�8���]�M�RUQ�*���H���54�Y������/�n�YY����s$*��M-@U���1�������3H��19%v��1�@6�R��-tP��
�vS�RnA\f,sd��F�U���g��+8����3~��(r�
�`����F,�+�6�	���Pd�r���i[�~wB�0�]
Z�Uzx�j��r���� r��oD?�*��6�.B��ors�����������8P�YayJE���*.��r�F�4/�4���bua�Yn���pD����!�D�nW�.^��A���������XS��b�����E�m��\���"2�J�cC'�6����v����|[�G��3���m��{��%{��i��9�=����������5+��$uO�3�,kCuZ/>�i`*�E������/��7��Js����4n�=��s^��[����k7):��i�~dZt/�'��wD���b^]r�qt��[Ju��R��{TyUG}5�j�����fm;��$����������O�
����yE�8����2��rQ	�LabWa�b�����d������`[�JW�
�j�}����C�l����dU(5����>@!�!W7���L2�4o�n6S����M�'z�I���V����,���	����gM�����(���R�Td�]�[�Z
<`]�����Q6W���������F�\]����:������j���:siN�L������"�
�$�h;n��.[��[vs�J�c^�Z�,P���k�9������f1���!�Z>����9 �'�x
���`��8�Z����^(�O]d���M�Ql&z��c����
�m���_.,A.��k������S�?��^~���q�O���v�R/Ti��Lj,�X���Z$+��EQ`zV��T���������
P�r�x�W��P,��������ey�E���Z�!H���s[����u[�h1������R�]u���ti�}�����j����	�Z1�H����}�_`��V|�_7�����Zo��y��������J^+��7%��~�'%��t��yp5�M
���+��P���4/\R�!2�0Nn������f�o���@��`i������Cs����6,�8�)��!P���k��|��n�dc�tI��3�*-W�����R���V�Phz	3}���J���Y���	aH�`T��m�M�k�����W����[0���/�\�:!}��-�`�5,�������6���J2b�����|�fgpj&�6q:�;��#/SW���(�h#�(u��i�.n)�'��H�w".�ZUf�;j�X��6�"����s8�<)n��Q-��w�Gm��
���~��X�U�����=,��v��fC(���S��[>v��]j������a�9�g��:�&�SR
"W1��)<���[&�����x16E��!W����TsI�	R����2�?���C��C��'-�1&q��'��T��F��E��gY��>��2W�	���v�z�*�V/
�M��((�S�1>>#�-?�n~���L�;���[g,t�s��	����f@�d��=��m���V���%���a��#��:�}���o9�� -bG*x:ky����^�B,�T�U[���w�.fVeV�'������ND_����+!e��v�:$k"r[�4l���[i�����Y�.�(k��L
m�B"�Gv�[rk��UG2{��[�'j���KA9�n��7`)����������P��!a��i|��+S����
�H�=���s�z� ��P"?,���{��������N�Et�B��a�M��e�3Z�%��Z
�7�%�y&D�(l
K=���Ux�U�SR�LTyQ��=�R�����u��d�^1��*/��B��T��TYZ�Zh��Jm�V�5�<D�Z������
q��g�P.�/�8����`-6��'��9�W��4	��\��4����z�����-/�/�-����/�]�]$��Lj���H�|K�p���u:� v�F�q��p� */,��&�eX\@�-�G��\��z�(Q�'��0�b�\k}���(e��)@#�-��PK�]d��G�kUWoh�_P5����WV�,��h�kY���gZ!�����{M�E�������mg���Am�q���3r���i0�5o	��4�K�,P���m:kqN_�1RI��d$�eo7�ui���z���.w~�)����d���Z���O�?I���Cf��	/��������	��%r��-s���z�����"����^	�0��������f��9���5!�ui�ui��4�+{I�t=�����	W�@�X
��})��Ls��N��6lm0����x���O��]�D��	V�������0��J@EC%cr��ee��0�'�>�n�	��������k�����|�h=��I+Oag��[�E�9�Zq����)l5��%+���4���4b��������R�k5�rX:���r�
=�w��/.Q7T��Z�(�7�2�"��������u}e��a���e:���|�"�=���H�d��1m�#�)���
���^]_�X����Uf�"�r$png2O���X��j�i��������k����fQ��Z��v�Y����CD���y�����K����"���)Qf!��i�#/`�b�RV�O��x+���c:�n���dL�F�F�&�����k��1y l*��.��-,|�f�S����d	%5Nrh�D������;��/��B����G�4����'Q��U�*����/z�9�K�
�Ogx
w?�.(*�Y�AQ�+�_Q}Ds�<�����=�xG3�����A�5o����-������x�B�;�sq�h�7�u����8wU��3
�-����Bn��(��o-��v�]�6T*�d�+'6��p��VdJ��YA��e��9���AW���u����I�.��7k�T�")����{O�]g��n����#R�l	�`����c�L/]"ve��:.�	�}/�zN��,cw�������`�Y5�:B��j��7-*�w�x%����*�����S�;�2�[�_�ge�E��"e��=��
.D��v��S��Y_N
'9V_�������dAOR����|^.�
xnU4�����
���?�V��\�����QzSs>��*�kd�j��ad��F��TT>a������s���7��p�0����'��[��:M+4Jn�K �+ 5��p�����Z����.�po��@~u���H��V����U�H|A���#��I4%��.o�gJ�d
r/�'{���i��"S3��C�C�s��!��bF+Ck���{�����(����m2���'_�u2:A�������x�M=?o�s�i������(����/}y���\���"�|�Q
��>�F/����@�F������Y{��v�����,��aL����Q��2�vI9yE���jtj�:�'���l����K���kT�����&��!>Y\�Af���y�RR,Q���8��s����%0JA5��b/����sG?+��c��\d�x}KF�B��I�M������{+T�e�b��o3�D ��,j�j��T�e�T��������������B���EU�����=��_��fg�3�HUt+>��x��fa�������m�d��h�!����&���^��=��u�jh��#����1��k�Y@��cF��=����
j�n
�}�=�}u���olI"�o����v�������~jh�h�.o���P`9\�)��H4i��w=g���U�0w�iK�[v��J7lO�����)�%_���v+����[=��9���,���m�_�:f�Y��wn�?I��;����0���v�Y>�����v���bU�j�v�������n�my^,�FH�{������|���������yj�k9���~��Ltn'`�:�Bj�.��7IMH��l)�}���������fr����b���J��� �:WC�+O|��������������le��k���i�l���������3���qPX&"��B�iQ,���"e�������-�>}v�u�(8����1
.:�6V���
_�����G@z,-�|��c��kF�Cm#,�����>�N%�d3
T�#�n�u�R2�N���<��v��$�-r��B��>�������������'���*����HU������L��$~k!Aw9�z��|��t��`�@/��V�[p
8|x{�K^Vi�8�p���G3Q��j"��W�����?���nd`��5�Q+�v���x���fy�������_I����]���2��2
�m���NH,��'%�xh^8c!���v(.6������"2���;c���r������H	�Cv������L��.�/��6�:I���
-��	-OOJiC[V@C�2=�
�S��xNei��*��_����h�g(�$���B�,�L H��3��Q~V�{O�������4���rR������q��P�m������u��*���c�_	a�����z���Y�
�k�6"��;'V:�*]p��������p��/5�|��~E����t�z�w:~:o1�&��0eM�nxb�u���k�JI� Lb��������z�7[�[�8`��r1�D�)}6����:��~�����V�����667'2��E��8ED����&e�H`I����x��jhY��23�J�l������4��EG
����[�H��Rm���F�'8���J#J����RNr��:9�����&�\mQfV	���q�mq{\�y��	��-�Q��|3���^��|�"s��t�UQ��	�f
�;z]V<1��b��i�E4�0���T�i��{��j4�~?��L�]�I���S����k��T�9�Z���q��hzh}z(\�+���W#�OZ YG1_��a�[Eb*�Z6w�X�:Fn�p����a���&�*�������}�����V�n�^>l.��D9�nSh�~�+��GN&���D>���anj�\�j���Qw���3�j��D�������1D=������\��7(4pek�0��]��`}���+�b�-���i��O�h�\����������FFs�lvk��ZV��-�pw�������tx]'�Q�����r�B������`s3x��b���T�@NH����w!���To��[?v����=�����b����R������O���������|��?��>�`�����4����|QNs�g�
��hKQ�7��J��D������/��?�U[e2�l��hk��/z�<P��=��%���u�^N=���H���_��!����{���^��XV�b��Q�}�������Gt6J�v�������VR���k]�c&�j�������7�n�.��G����%g-��^8���0�c,�5�K��%��X�L�	i�o��bd�	�x�>$�g+�I�8�����l�vG��%T�8����2������aF��6��-.g�s�I���K��EN��i�^�c!l��������TqV��de{y���+����&��>���'^1X��HB����	���0B����,���R�S�q���]�c�<+9Q,�W�h'��5gw�W@�[xjnQ��0:���\����tp���v4��mU���9x��a�@��$�ESQz��7���ZN�f�"kDH`���E���������R���x�S���������g�E�&��G��g`������!�S�����(+�V���F�;l��%^�o�yENJ=,��&2��������6J����n�+��7[y��K)-C"�7���v����a'���X7@c�}A�j��e�=n��o��a��%�$�XY���]c�/�m$����bu���A!��Y���P�nEY���4.\������;l�<MFotq�\XS�j��^��O�W�M}#m^�r���'���:���M��K����N���N�J�w�:9UD�������N�kM��GD������?/&�ZWY�Y���[���]"4�*|����NU�z� �����I��]������g���bF%p
�G-6K,�$��	���B���!��Y�ie�XeZ�j��C��{05�U��l*,}C���<��[�
��=NjUM�qW�
s���}��J*���wUE�[���|9�?����W�/��f��LG��.�"�n��d�����D�m��em����H"���!���S�}�o�� �|�Tp�wM��Q�M�nr�������<����7�.����n�N@KJ{M ������u�
�%��G �k�Go�����2�E�:�����s������+��>x��Z�V�S�{?~����ON �*��E:��6�����n�������1����o����Qk2��~����8�Y~��\1�%3�,xeV���<�����@g�����o\�M�)of���h�!�O���C�c���]�h�������z�%y���}��9�HE)��dE���lu�K��t�7D�u����t�i�s���x�G���t�Y��I��}=���2�����.\�B�@]�D�����z|������;���=;����reN���k�/����@2m�K@���X��C��@d���a��@f���th����kQD���R�aC�X�X'm-������E���&�t���������������������3����t��K���V12����|v~%4�l2�6�=��S����]��	��}������D��{sqC���1k��-Vl�[��������x�����3yPw�u�SZ�qc+�J�3��.:E��	���;���-k
]��o��ku�}��_�|D���{��xM7{��(����
K��0t���y4ZZ��V������;����y:�d=}j�0g8������qP�}��(!��I��js�8c��;2�o��)�lxb�V\B]3���}e����otu��2Y`�����M��
�=jj�p�P�g����;�;o=�dL���6����_W�����?sV��N�)���X�7{�l��m�#����1��=�L'hk�����@�^R�Ie'��2���
���|��w�������`r��
hD�M&�Xm�i|Q*���#�%Q�����xGB��~m��~���{�h3D<�<���i�*�v����'��f� T9�{�d�93�A=�]	y��t%�X
��vT�[fL��t��}��u
�K�C=v���T"<v�F�`��p������<���%���C���8]�e��l��D��v�R���"��{Zh1��#e��#_*��G)��M���� ��/a? ����5oN��RG��5�����I���$!�~�����/ �[���M	�8�����}�z�5C��
"4��}u">��o��,������������;z�lGl�c�bl���A[����V���Sy�)Q���$�J��m
��W��;�p����=��P�6���}��]�Y��G?��F����'�g���C	�6�n����*�v$�����n~��
����N|0|m���D&�h��������~��P0��Z��L��[��9����
y)����4kwE�-�b�t�po��#�gS���V)J���`^J2�#��6���%r�U��
������I?��#>�C�Q+|D��Q�.p\���P���n�3T���X)Dx��������������
TRm?$���9-�_�I�9o���n�i���J�R�������*���z=���( I�:n�������x7�U�_����+�r�-�(*���R<��p�l����H�*��'hN��`��_��c��������)F,�
�;o�����' p���:�I5V5k�
cg0:������&����IM�d�v�$�VU7<Y�3�+����G�3)����9�!k�m����:�A�#��|0�K�}[l�W��t��pL����s��������=��8�/��^�FK�������dD��;�5/;��G�k������n��h����9��EN!��C���.�_��@����mV>Iq�����^��l*�����^�e!�����Y���x�:�>yfm�!)
�g>oM��#��j<�
|DZ��t5�k$���
���������v(����J����+�(�������a�_VB-
V�N�0W���_)I$?���J���L~S�q*�G����[��dNMN���[�����,fPL�#�EM�v{Y�5�����.V>S�H�v�V��o,��n���y��3�-���[�q���q�d���c�V����+�<#pIS%�!0�&�[|,�,�����l+���dJq��?�^�M%�!��J���&�9|��S-O*��Z���I[Nk��4��y����jX���=��	�z�0������U]U�<v�)���|���Te�U����u Z��C9m�_��ogQ��6�������,
"-X]*��������t��8�s���������a����g4?�R���k��Q�f��CN?j�2��e?�u[k���b��'D�����q����J�wq`��e�G&�dG\�jpc�^�zf9
��j5�l��"����Me}e��[�x�-j��������JA����y�*R��z[`��Y{m'�H��.(����q|�7�	���Lft����i�P�tb�i�E	PZE�� ��,�����2�{p"^�N��T��H�-pm�^�5�-"�s�N�5������;��{|=���B�������a�%���-��{i���D<Y�A��~�eg�.�_���
F��������e�fd��'����:S�.�1����Ys�*��0�l������N�c]����-�e����!��W��?l���-h�b6>[[�%��s��?�-�R�z���Z��
�������c.��{v�`{��r0���q�k��[+�%qp���Q�����R�'c��T)���0�����S�'���L��vO�(z)i��]�P��J6L0��Iv���8�Z��]�������<m��ZO�Pd��<�������UJ�}��g��<g��|<$?��2�@�+��lw���*H���l�y`��ZA.�gx���$��d��>�
Ax+��82� �F�-%�$;�(=�n�$�>F�o�8�.q�%�R�W@Zd���Cy����}�8�O����me{�"��X�cy�������9l�x�9��<���A�����b
�K������L���l����R(xtn���1t�:��7�������A=K>-��� ���$S�������^iJ5i��Wo�T��	q�U L�U��rO��j1�&h�\�n�$����
(%�9��Ry;��z�:�d�j����Ad��4��4�SJp�����D��.��|$���51W����C�>U
�J^e���t���v�o1�`u:*o��L�C3it�#����N��]U�--	su���4R�u��3}���ky)A�@�|q���+��M� &5H�()nFt�s�����;�Ol���rh������Ba��T�4g#��<V
�G��v�HaFEW��yNP��}G��
����G���k(�K	jM���H��!���������X����_%�� ��d����j�v�d���u�zS������)��k��<�O���Q�i�����yk9Z�)@��_������!h�[����.�3-�C��4�����!aRK��^Y����K���G�hqc�Ww���A)q'��$��t�7"9��g�n�z&�-+���J���A�}#���j���<eG����,���%W��i$IG���P�F'��� z�f����|�/j5];G��/J������s�������ln��X6��v�u���
��v������h��N���T�^�f����c�����d^�D5��L;�����c
��Njj*H���wr�Nk���D��HM)��}z��=r+� �\4��|
O�-��U��A\H����i+B��B;��*���:m��������}���`=��v�����tR9�/�/%p���H����.p�l��7�K]2] ���EJ������p_N���9pZ2��G�f��G����4�Y�w�|����c�]�Tk�6d��i%��u,U�`����������c+*9�2k-?�E�]��������0��N�R�Vn��Vz�R������������������4}��A�X�c\y��s�����iB���z���z�YR���Z����o�����k��r}����j����0ob/)�z1���b��)�	V��F)A0��T2��w��_�(q�����^i�_$�k`;�Q���}��������
�����
UI{R����/��>��A�[>&pxXg�F����������ev����aV�E��t��t������A�eD�pi�x�������|��v�_�:!��UC^��.
h��cvIr������p:���5{�=Or�T����V<j�t�Db%�W�)YS��vj�V�>/��#�����`6�c��#����PJ���b�+�f���C��6�6����M��o�_�JG�N/�:}�I�!M�~$;i�l�n��H]jiQ��JM}�8=E����^�J�cE����!�b���
�]x���	3��L/�����!L����^�kE�3�;`��G��,0�<L�����{/KW2�*�\���e�q�|}�M���D��~��^{��j�n�yP��d���|MW-�cT�MZ�>.��N���
����q�`�\����^%,��908o���2�W���?������Bd���d�(~4b�]�i$��w�9�*�r(��c]��<�
�A�Uy�Y��/v'_�6����	(q[`-�a�O�QaZ
p�QM����\M��>�Y���$��n����	���R�fDcwK�-���d��s�@R���+�f�pKQ0r�s�3X����5x����u�[�`�2�g�������YK��Hq�����bR[fM�����������\@����d42���X�fV��8�u��&�{u�	�&�r
��3
����o�K0
u�w�'��`��a�`A�`S�=0�s!s�v��#,>�j)�F(��~����4�iI3,���f��a��WSM`���7�Y�*Tezc+���`���,���J�xIBCb���"-N8l��%NX/���|�\���\��.��pT_�8���+����X���&<K��4�
�;g!n�X���2Xl*`mL���vJN�-~�-.px�lhy3!�1���8�[M���D���y��14:b����PQ-v��I)�wc�]�o��nr��7y���7���24�R�T�d�`K����e��
[,f;*�;����JL-BE�pob�KD3�����Lp�g:.�]����:uLs�&�RH����d�������d�m*I�q�m�b����wD�b�|�S�����h->XO����|J���c�F��7X����^��>����$�%�����?���L�%��V�����(��^H������e��q�r�I���-"��a9{���l���Un�t;(�rq-�c])Sv�
�z������INk��}"�G�%�r��3f���/�I���������LE3���7�D����d�Kt%��t ]����!ig����K����<!��T�i�����x/�US+�d�C�D�!�
J[��1�������@8�Ty��� E�%BDQ�w�����Q&�\kA�����
(&{*YXT4m��y��s��6znZ���%�14��A,��3���H��\I���:rPQH�a�	��g��W������U������5M����[mO[^;]�z��Gp�����7A;��7*��E�Mr�Lh-u+yih�d�����D�N^_����@��G�D����@�`��O����_�L�7�5Km�/�����*Is�k��j�>�DT�����v�n������">JH�6�i����E���A�_�Y|X����*r�	lz��<��
����x+�L5g��^��o�k:��{�	�.��lw�'.J���%����#����BJ	�S�m{�!59{H���Zz�?%�-	���7ee���/b���,���������nK�}�R�5�e��6E��;M��d�%uW=�\5���h�
D���>s�v���C{�d�qG��F��	�t�V���d��B[S��ZQ���&��z�
3�:Z!eB��
����vIn~��w>}MwP[gE��F�J@ih��
��i��;(<��i���\�I�w���P�J#9�m�Q,i��bA�H�
h�*O���b<��l������+[��F����Z��z��jn�����i���b�3�B�}��g���SHI�|x������mB2^������&�&�>���k)]�+��*@��V�u��7��������]��������U����a����0��0[^ ����g��.V�7l3�{���@I���������y�eLL|,r-�A&�*��tCS#tWHR-]�!1#�kn2Iw�a6`!�����Y��o(Z�);�z�g�`�����m]�
����r��h_Q�v��0Y}i�p�����k��F�@���b����8�H�b��H����G�V���sK����z��k��4jJ�).�d}�I��X��d��"�� [�|��|�����V���&Hx&��z3��W���i7���"��N���+i]w������L}���`����4���UJe���3;���V�������u�<����`g�~�Y:��F��o�	�M�����#�P�F9��b��k���E%�	c�;��+lb����LI��3Q)���'G�����������B�<����4a�������ld"������C������B������kG�R=��">�p���lu��r8%O9��
A���T\s]"S��<����d-�Vz���H���al�~��e��r	j~a�����,%T�O}[��	-����LB'������7��� ������{���,�K�	�Z$E�DA�J���V�7E�+��>�7e�b���P�%��$���T5������
Z��_h\��l��g��Z�������nk$��O���4�,���TOT��J,o��^��B�k���:A�`?���{o���S@9�v�7���d�����T�~A�����sq�bsw�B�>�)��~c{��
P#R�F�����%���E�;=gU% �WO��#�G�-���by���Gr�E�pQu"��a��������Z��B�'~L6���(��~�����(+Y��@�X�
��d�������(��pK�����F9����.;�o�(��v�:"7�YoZ�~���P��Q��K)�-Av<���f[�}�^�R����o$���mib������C�sP�]�^�x�:�qGt~8����d�<%�`�`��� ���j��P�H�4�s��>��o��V����n8��#S3���r�����������^G
���r�����u^�$|S�|^+K���W�+��/�����6�S�P}��P�K��t���B�����%�����!n��(��L�kt�s�,B<�� ���V�q��!����y�����]�Z��m����+0K5��;5�)�M��+���L(�[�����1$�v}���D]�������8f�@�$�����\X�����M�n�$D�SD���k~,�7y�b���������b�h&O�_�{'?!�!v<���JRL��������
i����L�>
6�7m*C�"��l�@�����On��HaQNc����V���g�In��T'j�&�Z&W\��]��#W�]��)��zK
]9�6����s<x7����Q�r��LM�Y/f�	����T�zQ�)�^������'�~�����)�7?;��3;�am����)v/������2o�%>�'���]�t�k�v��'YIX\�����C*Y�/�!�! `! j�?Xp��C@@<�����ciji��jc������6ZcA@<�`�c�������g+=S{�����2��
,,�/*�/g��  \�! �A�L��������������W�/ZH���x��\'���6t��O���oF_A@|$��x��UJ���\-������9����A@Y[�;��9�Z[��g�w�sp���g����!���@��R����A���R���%�O�������������������`mma�geh�����<��NGO_�������~L����dda��`����<�����)U��bcgmlgh�����s�(=���T��p���4�~��n2L`��Y��;����J"�4�7������XiC	!%�(@'.-��/��~��#�� �C�om�`h�{!���aT�?C�������������������������������=�66��';+��OF��L��L,L��Y��X�cbfef� f������q{5;bb������������$$�_���b��QR����T��	38Y�[�����bche`��h�l�6v�z�&�����6���lT�?�����_e�E_�>�/Dq����g3�6k�O����'Ea����7���������fff���������600077���www���
���MMM���UUU���]]]CCC������GGG���OOO�;�@��u��/�?
(�,�~u��y�!uN0M�b����Qex&���BJ5}D�RI�G��9�����3�t��'�x	����S_�/=��'��u�a=��0z������FR����xf5���l*��G�x�L�d��(��x=->��Oy����b�\�^�IEO��^���s��fAQ+�F��m6�H�V$�y-���Z���v7_@[��/�|VM�\��X
��g��$nEr���y�1vpu�<a��_�1���09!GY�������3�&�_@E}Xwj��+[��������~�>�.�!N�mS?������'n�Dy'����a����j���a��~�O{H��6���+[A�/��
�2��2`����v\��QF�~~���esYb�-+����HR�!�B�l<~���(>�O)Q7����+BMz��!G]������Y��,2�&��N��t��:l]�O��ER]�6J��
z[�A����g{�3���P��J+����S������-/�/���mU4!��[oU
k���y-���
{����M�����!����cB+Q8�[M`�.�-�����.���A������
�<C�qu��Wq�'�F���#o^�����=��~gZS>
�6���8�rx�5���i��{/�z��7S���N�WE".���e���hY%��Xh�bx<�E+���N
G1S�����$r\������2�Z�KN)��������*Q��?:8����N������;�
�����!�Z�����1���\�=2��������N����%	�`s�U�1�Y��DH�_;�����a��|����2�=6�>p	I�o�fK�&��>����4�����	��'���n�Z�}��+���SO������/��}x�s-�ASl5w
��ELR�;}��^I��i��0l�/�c%�k8�s�y�&c)P���E!��5���G!�,9��N���Y�J��:H�u1S$���2}/�vO�.��3�B�����!���+B��:knS�dD|�x��./?�5����5*�MGU���������L��=�7���{_�W�2����{�������"^����O_"�d���s�x��$�^(��b�� =�f�0�;
���DfX�O�Q�@A�'w3���E��������~t�7��#���m�+#�c���m���"K�K�b�Q�������A��'�
��g�a�E�6`��D����/I��.�w6m�
�M?���g@�{x[�b]��w�m���a�H�b�E��F�cnJ����v
��A|��{��XS�z�
��KXS���yN��b"g-4�Iq���%Z`@����Jf$7��f�����m/�`�:���\*%�;��nRW--yz������	oZ6Q21��;������5<�V�W?��L�4f����JmX�G�A�Q#�j��	����=�0%��2�AhDVMI���F��JR��M�G������D Qg�I�.�R1�%�VU��K�&�[�H������S?$0��A�K�/����~��b�H\4�L�x[��`bJn��x(�Va����2�b��C��	�~�f(�!������U|����������tx��4�sG=O�M3M?fA(�5k�)����f�nlMK�}����9,B�b����KD�mf��Q�����<�x��/)�jD������5����W�v�n~,)�e���W�M��y�%[V^r�}��������T��s-j�RQ�5X���� ��i����7�iNv$M�csY��T�" �i�
W�����-�k���^������*��qo���>�T�K���]�-+��#��������s��5�o� <to��*QS4�A���=Z?��Ds������#�j�~�����V&��l"��
�Y����hG���M�v�����F�3:�}E�:0����Xur
{7�<��To�Utr�����/��To��~�����R�sj�Jt��[�R���R��A�~��Pq����Ft�I#���=�j�+�q�nL/a�����x��������IN���v��{�"����U8��0K�Ht��'�.]�~W�(e�BM�]����#wM��`#��Hg����h�rk���(����v��K����\�F�	�n^\���4�B�O���4��^n��,�q��wm�h��d
TrS��
+�W���JiA�B�u�"�#%�Y�adr�=*��B����L��`}uez��X����E����Z:s!Z+�6�6q�����`����*r����G�����<T��E�/pl�]T�8{^8�&:�F���R������e�G{���t����PMuS-��b�"*���Y�������S/��/������s��5�	1:���J��U��z�w���P�4QY����zu���V��Q(4~����_�������g�s-��c�\�L�� �}��Jt�7�{�*�W��$,������7Tx|RW�S<DC*�A7z����_6���TE8��?/m�n�\p��/��@_&?��H�RM�!����9��8t�e\9�0�Mq�����n����$���[)�v�>��:,�K--
!�b�����L��.7})�7:k:����3dR�6�l��b�7v�c��\1x'<���3�
�|~x�,#ERZ��8������VJ-�0�)U�����N{������.����x(����.�N����Es����G2��P�.��$k����e�����G��*JO?���k�W/���R�����X|[��ih����,����z������=�q7�#�X-U��a��<�D�UVs����)��6E�����6�������RA]����������������������'�N�
���-}��}Ko�,l��h��Vn�
`�Ew&���`+���!��ff&&���PT��l?�I@�����;h3�yS�a�d/�����(������������shh+��P��
EPK;U$�8y:}��!�.�#i$�;�"�x>�hP�T��, t�Q)2�)���L���JbY��/�^?��P%?88`�9���tp�>��*AF��1\���X�)���,�R;���E�|�����P�r'���GJ)`�M��u�8?� T"@~�I���{��j����~X�/���c�����r����/,y�WWmb���2X�V�L_7��B�7��"GM��C5��31�
Sp�|c��d�\������o�������W���7*SA��(�WHyB�Q*�w����
��:/[��X�����]���g�0���Uid�W�j���i(��b��d����	���rA�%�<����}�����n/b��C,�P���|	��B9��1��cVt��<A���L�U,���v�#�]�A��I���w�j��\3v�� ���LSv����[��{��
u�J3���]����������v�4��K������J�!��<$���MD�=&
��^��^K}#(��������e���nu�7�v!f���_����-2�Mr�So�O@�'\�r���rC�����F"F�7��pra]���������1[M��"�!��@�h�m��&_�~�A�K��K��"XHI����eASL��w
��c52����N# in'Oa��G�����x.I��z^
�vJ�[Z:�[�eC�5Yo�z�(�S�c�6�o��o{}7�t�<�GGU�K� T��@1/Wc�?��%BX��&����
h�� oll�a�~hG��A ��N$�!!{%�e���������`���R�����jQ�a>�p�6 I��PE�I��������:Z������W��R1��R���y��W�$nJ����m���<X�OA9@"��$��6@V6�1��M�p���y��'�����qq[4��C�n�E.�}�k�;�6tw�t�D�g�;��7�.Sia��lK���:.���X��U�H�`3"7hM/��o�2��il&�)�`��Xu�1*�}:.X���N��"�yXG����e� ��`�E��� �]-�K\�W1��L�{nmZ��x����pWT���D`������Q42�^�2�|D�&B���.�x�J��������!5��+%�F�����Lx������D�8���oCmv�Zk�M�s�e�,�f�Xp�+9BN=�I�a^��~V���B�j1��9*?�tX4���h���D�B[��	����;[��4���]����um~���tX/�������{����%x�
c��������H��V�m�9���N<����7�)N��+��]Ed_���(n�Y��]�`��R~��:k�n�P0�[ys^�L��*U%��rt������l8��r43Y�(H�z�������5�����)
B'��}�~���m����������GE����)I���������c[��w�;�5���HH���r{�c������H�i���1��W�W��-��pR�m�����.P��N�r ����x�������������i�C�s$d�<�SA-RZ^�
[�Z�x�������g]M,��������z���	T��R�G����>$�#�u�@,,o����}=�6n��2-^�8nQ�����O��s���KuQ9]_����{�|��U��������j.M�X��Q�W&�'�����!;4��P4����������V��	���O#gg��|#��l�#���G#�>k���bj����O�Z��:j5��3�}���P�]o�������(��Z���G�2+j�cRP��(�5����%{.t+�B���a�I��<��zE��/������=4d����s9��������iB#E��p�@�����S�����J�Ubb
�"����
�����:�
�c��okM����������:3�}�2W6d��^��(v`�F��I�S�L3t��!�r��R�T�����7b�~�lY�m�d��x��R�8�k2F#�(6s���||C(1�����5�
)�u{�������)�����/�����"�
{��|}��������_R��S��l���\��;Rr�M��7���RTk,R@�e��X��=�2QA\�����-74�T1U������5EP�k@B�Q��D<gbLC��p��*W�5��t=�7l[Oha�!K5�O���j;�B� ����nu9�`hcf%���-Z=���1��=]�bw�J*�7��ZG�GP��I��l~�:�vVx��'Ow���<��V��XX���������k��R��
�e&>;����	-�;V�����)�.�����`��[�����l������s�4�<%�L�;H��{�������_�3�&�����p�@Q�j���)�m�M�I3��k{���<Du����X���2M|���)#��}	���_�|������4��}P��"[#7�d0���u!=:��](U�g���.����S����3�gJ�q'��di�\�gDi����c\�'�F���1(r����pe����,�X��W����E:�*�1���"}�.�,����-],/���s?�z�������|�(��o�w�?Wx�<%���P�u�����Dn�j#*��[�&�5�R����pce��6�n�A);�B��%�6E��4������+�F��r�t�7;~�>N#�t���(���DB���$����.5��07V��Z�qWd4)P�P�
��k����$�7�~��F�DF�(��A�I)��b�.���o����+1��!wBjh�S��tE^���0�bX
���c�\P��*��&F:�����F{�]Z�����!��p�4����Z���������S�c���������<=Ww�c�OqQ��W�W\B��#�8����GTqv�-�����w�|�DZI����}#O�^�'��M	�|���p�k"*@�w����W�E�ej���yx���>������p���%��`�������#z�9z*�������n�KC(V�����L6,B�[)�����c��o��sZ��zZ��T�������-/�����Zd�B\^*�[�����B[��0�l��ST�����3�n�|�c'��t��"w�>�q�Yb$r���B�,���~	��������k�g��=<�Reg��#nC�F�Z"~�7"��+v�}���x�7]�&&����7����[�0����.�6���
�9d,K�o�U���{����]-�'�SV�h�v/$����c�=`��"7Q���&�! <��j|���?/��y���9��'���4�@r���m��;s1CYs�z���Nh6�L�2-0_�_��j�_�R��z��j0���N����$g����t��r�+�Ha�#�tn������n�2����J����{JWu���I������b�x�+��g�3�=_��oo�O�������f��e�?�$��������i�X��7�����a���Y�_o�_������e����������[�1������}`����l0Kog
T�f8Oz���,�S��>�Y*��Q�|���ZY����O�nP��f�����!w�ka���(����/7������S'!���b�*��%��G���c��V��:�0\�����9�g�"�p��zM���U�Q�Z�V�%��8��C8��A?�Fi���9�����>��I�,�S�����`��m����cM�U�x�c}7����U�8����2�M����'�Q�FY��G�#��wQ���DJ�e��������=���_o(Q�4�bDY�EY�\8\����{	Q��5!�"B��i�M������d�D��a���g�+�� ,]�PK��H�����v<�v2���c+*� .���q��x$�r$��/�����{���s!������DwM^�V1��,�n���N�l-�R|�<��Q/����r�5�zQ.+��/3���@_��S�I��6��{�
�-������}W��7m(M�-��[�r5��z��'$�x�yV\(EXe�o"�
*+�>���rr�xy&t"(���c=�����������/|�H�.�$�5���(,��:eh�������VM1>�q�z�4�*0.e#(,G�;��a���*^,�ry]��R���t
vi�����a����������n�J�tS��h}�����J�:[��>�e?y?N���T�N��N�Z��v��V������� P��������p��\���z���(���r{|t��r�wl*=Y������4?���N������WW��nc���n���
�P���K(�������d���m6��t��F>�u��������f�����*����i;Qg+��Lt3/Ox�;����x��;;z=5yB�����P��Z��{�q|�vt�UyB41���p{\s7��z�������y�'�o���w��mkl}�����;�$�^m�k��=���$��x��u�K�</6�KA��Q~�$)`����,Y���*}��9o��R���=5���x�~g�5���}�����xx�|p�����������Y���"���''\~�
���hIw'#{�y{<�:�����d�V��������'�EG�[��l���8"Z�#���Y}��.:S���z��q��5{�=t��0�|����\
��������I_~<������A�>#tK�'z|U��j���� �����q����t7Q��>-������n~;=�g���I����lm��oS�-������y�>��u��������4�4-��r��v�{��=�X?���������v��#�r����}�����5Lx����u��T����v�s��� {�J�x�u�$*s����#�]��i9����z�ts�����9a�m>�(�\�5=�Y�")�[�.����7�V�Eu�#����d>��$����^�eFW�t|�f)���TV7#����Aie��^�'�,A����'�L���6��a����2^�����5��c���F�=���#�����v2��zst4��i���F��0���g��������R�h�w5(�����
w��l�IgF���x��t�3d}��
���]]>�=^�������x����B��r���v�����a4M�����_w���A���
��L��'P������uO�u��LxwO��8]��x|��e�hq�+��Ty�1�A��?'���&D�.�L��IHN��9i+lJE��q��X..Q;{>�$�z��������&�C���6u���/b�6a.��D����{bw#W@�ue��M���k)^#�A�M=�m����M�g=�n���q�b�N�x��_\���L�C6���z^����<�D�?V!<�����rv��Z/�����4��.��ou����<'�+�G��O''�-/7�����Wo��,�[~|�*n;��"Jo)�_m��
�^l��~��m�W$��	���_�����V@���W��/���}�o/��y]=���A�Z����V����k��P��;�q%�I��|W��_�w��fea�ii��j)�o�����}�����0�i�ez����j��=���y��_���^�.���z����V�Lu���4�]3&k��B&��s^wW�6��@]���
dl6w|[F��'
DB��'BU��7wG*'��Z�u���V��"�&6UN��o�]e��������No���%]�^;<�Q��l�9��������O�K��l;�O~

w��w�.��`Z�\�����
��2�-r{'�8�D������W>�EB�_����v��!�d������v7B�?!��n5�>�w�p�����M='�Y��x���9��"";�Z� M:��J=�hfr��"U��"{����Q�'l�-o5��9r�����e�Z!#q� ;&�:zQ����>�D�t
�Ik�i2o�?�bM
jDn�-ZKL����kEm�i��P.�}�_M���v�v��9pa���qi��������Am��{�M��@�o��8N[+K$�;�Yk�fC����&���ps7 �};��N�v��wh%�=5�D��]!C��Uu�����=�����5?���������t��&���[!�+5-Xvu){�2z<:E�8�4��I�vA�]����$
�w�u%6���A�M��'���������N\���f�>����[���eB��)�ZR����L�)cWZ5��E��A�� .����LHj�}�nyeN	H����*J+y��Q���I�0Hy0��K������QHfuVg����5O$u�\*d�9��
���1�;����]�
�a�[��������h���[�
�v����Dj����H�uP�Mj�A�	VlU�E��F�r��Pd��������N �����f��C��c�sNnL|������o�@���;x��A�
��x��~�
��3*O�7t�J����m{
\}�A�K�
����q�|����'�D
^����f3Z�>35��<8T|*�������������2B���c�9���������dh�0������rX��x�i�^��l����i�G��xE[{B��<2���*Q����|�i�������T{RO���R��W�j{]����g��������$p�r�Z�z\���u���dh_������$:e�z��Lx�<����w��
���G�(-�1�dF�Dx&~wxt�va�W���W��W���?�z7��H�����S��?����'�y�@�D�������]A���1�������,]:?����C���s������=�=I�{��y��d��N+V[Z�n���?������@#n�6��u�����Y������ �"�cn��<�e����]p��x��s�^>!|b�?�����l��:&��\�����u}������������S��[@�������h��tN���n���Vz�_�Q�n�8N���;�����!�B2Q!?@���l&c'	���������^�
�F����O�K����q��T,9������7P��4��������0�:���[��d����l�����\�~�Dx�i����57a�C������]F�E���X�\��m;�R�F�.��.)�[�������������^P��_D,���b���l��(���!k�u��b�WQ��ykK�� ����m��G&{>���>WQ+A!,A�������ji�*H#�R"�(������!)-�t)H+����
��%�)!!~������wvf��3����E�Z���:�ZU
\������}�$�%����=�������;�S�.����/]F
3��k���w�����|�C�-�4Wu=���
��)���r�������w����5��\�}�(�8��-D(P�&:B����t"y�u�	�����Ce�|V��%����������%*�)J2B��]T��aV�2,a����r��m�[��2
f���fG�m��hs�fl�s�X����h
/)�F�P�����)2Tu�o��>1C��r ���my3�����V>L������w���;�H������D<��*O��/�kPeT�K%|�+�J���T_�o�x���B��Q�?N�'e�����p��t��>\bp�Qmy
�:�����_��L�����m9���v�7v�>|3�PRii�?v�ST��G?�Z��h���{R3�����5�+�5F��b���o����uSf9�
�E�\6���w���"'�J����^�P�s6sfH��Sn����y�B��]��^�h���noR���Q�	z�d�i������r����������^-k��vc�</��)�����K�n����i���MS]\$�AmU�Znm�>7��
4��;�W�ik��XNw��c��
LX�Y��a�����2s�
M'#��xU����T�g2�5��7?=�e^l�����t�F�4���fP�	�4�a�i�������8,��6��	�o(=�_zT�0bG������Vt����Z3��"S\�! B��n_^C�����7��W�����+���E�zFfv���d�SS�(�����J���_���_�-)i����z�����:���#�A 5\N�>���l����3�Q�6[���A(=��5;���v���_6�6�	8����2�x��;�K��g����]p�FM�\�L�)L�,.����re}9��f+����O
�?6L�jL���T�"63�z�Ivg�����{���W�g�Y�&��i�����>�|�V�cj�C�K�<7@>��7?���[��5�Fc�������{6$�@�
i}��-p0�@De
�A��$c��~q?���/�������b�3o�N�w�W^�)�%Tv�R��hW9	���B�"K�wH3����3�)��yL��/:�@XZ�����Q������F�����M���+����'f�������v}�~��Y����"�*w����G����|���Jm����/�@Yz��D�]%�*n��G�a��cI3���L!�����A�r��'�;TI&�<OW��!������+�3b��_T������6cd�9��G�����'c/2��^�
&�DE8���I���Nn�(�`S������.��od�����f����2�A{)z8������Q-����������U��D���!��3F�
�hJH�p�>��z'%N��aTA���K���J�#�\l�s����~�!p0���������3|D�QQ���t��R�)���~J��f��x��3�E�y�j�����N;��P������,%\dl=)�E?�V�/������<�X!��Q���X [:����N��	����Ayq�����4J2o?I�a��d�L3��GX~"G�>����S���8����o��-}��8��������|k[[�Z�y��d��7�?n�q��S�����������p	��"���o���{��4F��$;��R��Ha'�?�D���`C�,9��L��_�\CS�T���
H"�IMe7���W�8V�}�"���>fy����+p�L�p�����|[�����1<5��VXx��������$����C��&$~�X_�(��]���k�W�T�>$�v�_���
�
�]��������]���Z�+<�6T�QSo}���A�EX��&��I��u���	M���A�?�@��8�|���%��w[��~N7�Z�}�3��J��U��Zz�T��1�7�D-�AW�q���U�M���q*}x][��@�����������i��\��X�����5����q��������
�z��|�0�T����,U�i�L�/
�2*�����0i:�?�������H�f-�������{`n~��(}�F#��!�<J��}Et��i����E�e]��b�
/��t�G�����;����.�e@�~��=�Cs��l��%��7Y��G�� �7�-!�y���#�
��-���G;(�-d=6#���J"���d��'4c�>�o>[���H�E�F{�4c+D�[�2����>r�Q9���3�P[c�����O[��t��}%Vv6rq �|��t����@ZdM��P}���K{UWgDC��l"8_%]K�cQ2����]���]��#�Qi�}���w�(Y`��*��|8����^"�^�R��N��cB$w�;��mq2yZ|�y�6w��
	�'�:f��b3���������g:�[������r��wk�)��[��y)]���I�d�X]-0�E�+�y�������z�������#j�N�,�����rR�� %#t	�
w1��{C��L��@�)
u6%���%��n�hE����sSQj�(�@Y��=���A|�D��0����9J�s0��6�6��R�����Q�/u8�I��"+�n���J}��!`cY[['\��z:���<.��{�e����Qj?h{D���"��\���&-����Ds"R��G���Ovi4����:��QA�!P����[�"z>���x��H�O(�����"6���E)�8~���G�qEiFHY�Zw{�/��i�3��.��(�����R2���������
'�������$�18��Z�g\���a!{85��K�@�$q=h�_W$�����&p,R8��AC�W�E�
�Ks����1ny�W���[r-�Y�������I��t��~��	�Q��G��*kS!��{�9=5HE�0I�B����WS(zj2U!d�����?��t'��u���5u����_���1�{7�}6�NC��V�X3`
p��K�D��#�L�H���	�ij���A/����\K�?b�4}����h|��tP�{f��V9�����w��O�L�!{�G�;�����idx8J��5s��T
/����>J�L5�����7���h�;�6�!{ALnr����	��T+>��#'�|�i1����Wc��~��,QX�\�y~W�d9��>O�C�~�_���f(�'6�@�l���b�R�wW�xU���,%0V��{��1ET����
����m@�\��Jo�eP��=;"������T���*�"0dgd����^Mi�w�~��1�#�A&Y�Xgpy�'�{�E�g��_�Y[�I� ���U.s_�����{��
m�V^�=m����z'i���2�}�#�@-;�
T>^*�'��E��kQB��$� e4�?Cq�deQ<�Ai&�����9��'�[|������ey)@��� ��d���]~72��������>�!k�����>I�=�����jg�k����:I�1�PF���J��|Q�[,2a�
���M��S`�,cz��\��9�TGXA:������c0����x��������Z@�b�L���s$�@��'��Z-��z�:�@]���h��aH��I%������&@�e�����B���bs/�m�Y1�L��t2lg�]�L�H���,����2��<nC�t����a�J�4.�N&��{z��'�#St�������G�g��+�'�������0}�u�G����w�� ^
�����:}b�>�������O���1���%}��"C�)��4���75��k6�+#�����$��r��q}�@�P������z%�4��
��1D���������{�J�>�ZT=|
��_�
��=.Gy:�j�A��O3�RYso��4�hs�	��Y6Is"��!���pG5���d���>��`���jw�1�Oc/��>G�kU����:��D
9����~z$qw�Q9��K�=�x��H0�<�Q_�$�!����coh,?�Rt��'�� J3����Hl��W�I��JVz�=`Y�����xNW�r|�r���;�}����_4��y��z�@]�� �'iDy��3���:(�H�|�
T���U���$���PE�6H,�D�J��������*��Ln�sD���I�c-S{w-Pj�T��y����\��2�������A���H�����nu�L���uQMciO5����n����L�l��
R)�3��c��<��T������m��^�����|�YcM3�t���/}��m �����Lt��SZl�{��@.����]��x���!�v$<R0`�?B]]N5��DW/��*���~�\�i����T�>��hY�u�*Lf?�O�l��@�P�?��`��5^�V����q�!
}2�_�d��5K�����o����4�J���1�^���/(��yF!V�&��
�3�����D��t�k�
j�O�$�f,�HE�6X���J�mz��wG��%\�����|lq� ��ST���@r&�\��J��4����v#�I�-��d����
�n���#��~�<5+��j�����/�'[�����*����V�_��
�y��W=\��g���^��E5��1��U�����ju?�!3*������Fp�����=�R5��{���c<����x�5C���P��]���[ ��$h�Ne{�����R�Mf���[���.PNKu����1a��"�{��%�G�`����0\)`�5<�xbf��1�d�^�'{�l1�����y'Qaz'K��:�h��K�2V����cw�D�����P!�2����Qtj�=�@��=EM��+��^H�}�!2�l|�Rd�{���[D'c��q�G<�I�Q^.J���NA�^����:�4��p��	t����<�!� ����c��a7���[���wp��G�Z�T%e���j	��Q4�l*��n������9}�|����&��F@Q.���H�5avE�L�. d}!3���$��@5���a�]��z��^��^�kk��0�	��s�=D��R�I�����B�'CUg��;2�	����~��+�C�V]�F����~����b���3��:6�C�k2u�o�P��9����zP��>�nEGr������T)�+��s�6�O��{s,������5a��|Y�N��*<n�f
�5�`?�����E������Z�/�#���2G�~������D���F,�Nv6�>��Q����p�@A��fb�*�d��Z��(�f�5=V-!�7������Y�(�H���z�MG7������0��d�QA�+�e�s��Z���j�;N}n�,)�<�^���J�1$$��+�;"��!e���d4'�{WH�]���@��5"�*�����7���{V�EK��u8�W�(J2���v�(�
?A�>n��U�V�2�1
^��^�XTZw�����,8,����-h$�^(���iX�u�������*R�A�)gd�m�yI�1��w����s���Uy�Z��S�{�����G!���������b^����gt��wVq'p��d��Wd|J)�ic|8���w��U��������hr�G�:�^g�uh���
>+%�F��o�,�+�f4��d4Z*�s�����)������J�0NI�-�Xd.��*\B��\�.�������K���S�J8Y����GGBp=�������M}�H1��i:jau[%wdm��*�%[($Alu�7Q��8k�R�<�f�>1�BV�3���.���d#����"����J.Y�$=-��(���)�w8�7�{����������HE�:>]��^1���2,md��#�G'����Vp�u��O������5�=V!P\��������|MFs�i{������'��v��;p6��"�V#[=qP{������]Bn��pI��w[�1��^p+��.�����2U:���g��:�A����|������"�'���^r��'+��()����Dm����p����I>]+�F�$���������������q��uY�����$f�h���t���K��T�B�^I33��mX$��� �h�o:C�e�O�T���x���s���w�(�}�6��N���P�!�
\������_t�����/�����7���^Fn�[�"HOq@��#����i��,3�PY5[���K�g�v�J�{[����E�a�(��d�`���&+��M2O_�������9� �I9���a���(X������J������~j��:���P���l�"V�gW�?�����;���
�s+�oR�\��<��mK����b���;��)Gif3��B}Z����j����G|���LH�n���T��q>�f��\&��{y�[��y�ARq��n"���|�����)A�w�;, ��w������z���>���x2\"�	bqY�d��X/��Z$����Y�e4�"66f����W��M7�	��V��X���Cb�$�)_�G����S[��4J������@�t`���u����hLP�|��oZ���R��w���b����O�$PT�~4G_4R�Q�|�$��������"	��w����>A9��|�!i+�%`:VP`OK��=�
�1�e�s5��8!q��Lj"��NJ����4/��!�����/�}�������E_Td�<��$�k�1y�=��C��M��kM�����&�[�b�:�:��c/�$�Mw���;V����Eb/7[Lk��m�N52��]�c�3jO��v�X�0�90���+d�������{���������{9��Z������Qb27#y���ACN�F�<�d��,�,u@3Te�}E1�.���j( �����8.w�M��
q�1�Z8�4��V!��|tQ6�p0��!�Z���~s%K��=�T� ]���{���l���H#�V��MgH�RG�(�^��J�����Dj�������(����i'e�b�	.V�w������D�Y���|��1,��U��X����w�F�$��(���M��o�����{�.�E	�I&���QJ{�D'J�frq��
�_M^L�@��&���Q�
rr�)���������Y�X\���}���9�
���E)�Z���d�p���m�cS�����MI�~�S!�����SGf�ej�B���o���X�De=�p���������
F�"�d�h�s�������9���=�F�w q3}�D1��T�`m�n���/5���\&Yn��D#�a���V�������*i�H/5)Q��_54������@}��{������/+}h�	A������4��������N���^#���rG:�V�eE�e���F*�x��'�hv�����f}����d���*�J��/j����!k8"U�=������5�aR����+�u_E��9�
�CV,�:e��n�=�h�^�1�r`r+%����,OvNL��E9f�W��N��)����Z����+����~g���f���U�}�����uo^��<��p�^iQ�`U,/���v�&~��4Zq�K��k��]���N|�Si����������N
�<�9�d^��h^����e.���OFA��K��C���X�-�Bjro2�N��\)��-�[��������nCq�D7�[�����C0dNU�]���9����_u
��OQ���OV��9����OV�������+�>���a0�MwHw��JEu��BYR����Z/$��>�~n�VG}Bg0�6
r���&W�L�U%��A8�E�*s�9��!�D�^����N_l	�)�I��Y���������E�� #���JC~����y������+����<�{�
��if$~f��6�&��:�w�������S3����y���������<�As�
�g��5��Z��u��>Yq��k����egn@� u&X`j�}��X��[�A�`�)�����JM�r�yM�i�W�bwQf�=��$��~�a���s�m���u���,��I5����2Xe��W���U
�D������M����>��Ac��
(��l$�`d������x�Az�Z��8�2�>W��^s����@T���#-�:|�����g��BL~s�\G��<���@����k,��g�hZY��:kgWD������P�E��B����J����b�
G
���G�5�
�����Pl���88�6��k��
9������!�1<���}��5�\s������u��'"��q��[<��7�������8p�+��O`��r����3l�����]����-*f�q[s����a�]��&w9�����C�ij��!V�5����P&<?nn��x3��4p@���������' >�����,C�1D�A�%��x~��e��a��~>`�x�
�K$�2��3Z����GS�pwxI�^3�`��5�}���� �C=���N�(����.N��/	��/
�����#)U"V��K�740��6���^R;�3��2��(���$�w�L
>i��������>��?���U@���p���:Z��/�|��Isd�W&pv
����y5T��AS+�����:�W��h0�B3���0���#��)w�9s==�`��/C+2����
�[�����}���q\��$�6�����w�_�������'<+�!#�p$����j�(j|
��
��(���@!�G�n��'uptI�oq7�b,B$�+��$��KTX\9�%}�]�������uo�a�R)�������o�Da���aO�LRj����f���S��L��@P�P��r�U�dU�w�s�/
L��S�-��;A�u=0�H���;��ev����P9pk�X*�Q�n�l��6Lv�����i�"a�zB��>���������:Kf�F���	��L��A�gX�@�T/��BU��o3�2�$N7�et��X�Lul<�D��2���Ua>�G���������w����T��b$��������|��(����
�����������^F���l���h<I8N�������Z�Q}�z�	�f42j�7P'PC9�f��
��#6Iv��,9����5��jEo)M�:��o��|�^|���������E�h��d���n��	L>P����R�O�I:�����R�|�2`���w�!����#�vE����O�M����t���E#���e .!����Z�\)��w83v��lj�3����g���B�8�����Y����Q��502�(�X"���51������*X���#��}�#�������r��G)��I�jH��y����R�y
��t����h�#��=�xj���b����|Y�-�t�x"�O���*
H����()p��wA����U&7�����^ZX-�4��>3�DuG0���s�T�f5�����k/����$8*P<G{�����{��c�@f���	R�RH���U��Gn��Rw�_yz�n�� Ri��h�p=�>�9k-�������2����o����^�}f��M��Bn�eC�'���D�����A���@p��������M�Bq�2��}�����������������i�����p��T :����0�������$��)��p��-t�������m�{
e4�M�)�(sz�y)��c+�}��T*��C&P)v��`�+�������L���u���+�L��{Lu���U����+�3��G_<��xOK�2�TW�����qA�0��w��g�����3�Y2�peR�H��������q�)������c#�%�YU��z���u��?'�����Li��5���K�B������[�<��|]>�h��U�����2��F��e��~I��p=�^&i���B���f��p$m4�����z����92�i�2�?�DnR�.��Pr������:	*�3!h���)6���d���S���:����X9�}#�+*���9����uRH���w}N����!�5������?��G�������t:E�%2�����7��5�^�PP����b����J���0VwMr�����Xe��W��3i��+���1��X�n�Ns�&��:����;�a����Av �x��D
	1�,/Y��'N��tnGHK���|���_��@XL��y�?a�M��H����_n�j�H'	������V��"�D���O���J�pL�h����	o���UM�9�5}�w�?k�"�c�c���A�{�z��9�����<��2���B�A"�������B�������b	������mU&�	���M��wS:E^d9r4��]P���X�����>���;��t1YFM����G"�1���RBK&�����X�R4/`dh1��bc}�<v�3���E�Q��G��m��(;�rC�k��)]x�������;pc����������R���))-�P�c@{��>a���� ������g��S�����V�������j������g�_t��}[��:�������[���@������w��M$�R�4��C
Fjg��Vnv�gDat�.��S�Ae�s�5����;���������&���b���T#3a�6������{�+�����x�\�c`�O�x�Z�iLFi!�{��f�I�F_���v������)m��L���:M5H��:_s2[Mm�Z��S��@���
��~L�Cn0���|�4���`[���2��L�)9���Vm�ayTk��+�u:�Y
���uJ��������q9Ne}������SQ4�_�T�Bfj����/�:��`�Q=-n�K��|�1Tt(�V��5P�(#��c�M�;_���>Z�ti��A���I$��5�O�� ����sj:�����_����`��=
s�VR�AV�P�B����r��k��P����|�U����8/_k�sD���o2;��8�C���1?��:��������T!#�2p��/�U����I�c�SGbn���1�j�[f��(6 ].�O��'��!������R�WS�>��K��fkI���~��_�5��!{�V/"��z )#�?�EE8��Cv��z��|�;l�6N��Xy����b�B���w�W�'U�WMV���@sL���F�fm�(��M���/�������?�����J���x%8���/+��/�4����Z~:ik�}2�v�W�D���F��1d�� �Q��%���+2.�YPkiERE���3�P��0�����%�����a���(��&���?G�2�D�_\��Q
A�r ��l���s���Kd\�X3r�U��XBW���!����
p�x�^����J�Y cc�'J���w��T���x�U�0iS�B�9*��W�.����g�p��AS�j��U���/s"����p��<7/���5��=��l��5�_����;�.�$�cn��"���Y�@*���Y2���R����Vt���F����X�����{C�-Q���A�
6�{���=��9��Q�)	��/Pw�^/!��``j����;53T���hy�[����\��jP:�/�n�\����b,���'kJF�*�cE�[<���/+� � �I������42U������l@��>2����_9)'�S__i��{��Th,�2`��}�5n�jb��m�m>�$��]*?���7�3���<{��H_��l5}����?Wz
��a��O!�G��C?��\�&�~= �Z��+fQ��qW�5�=������<�t�t�������O7�T��<��+��KQ�W�
><��	���_,���e���3���M���y7�bY��`�}�Jd����Ek�Jo&���B|>�������=�w����x�W ������V�q������}~����Ve�������U���A����}�~������6�x�G�=���������!��UJ0<k���bw1�6wG�L\��z�g�����$~?[�?+����-]�0�:���F�I�eQ�x�i�4c���$�Rqk��w���kl�^��jE��.��5?$�J	v)Le��-u@5Y�P�w)B;��b�-
���}��d
��������L�/l��P���Tp���P @����P�d/���V�����74�v4�� `���"��J�C���������h]��d��+h���*���e'���o�>y_dMS kZ��v��[C�F9��aa��iy��b*�Cz����������������+`J�	U��W	��Y(���U�q�J
�QR�;�|����gNUN�L����\�"eUt�������"��Pt"��L�R	��<�}I`s����W[-eC%e��������J��0�e���9�$��Gj ���B����3�
��#�+we�EC�L�a�KL[�U��:���Uw���v��cV�{1�����������#��
{.,��&��>����H���4�[��6���������A�����hU����y��<o~�]�Y����)^1�@�9q\mT������}<��$6��`�D�i�!��}��e����k�|l��`��p6��&����7�F.�C
�����2^P��F�]���R�����n�6���Z���p|i����{���G�d� ��"��f���9����������5�A�t�&�����Q#so�rYS��%
�w�	zX'�L<��������"l������z1����
��96,AY~J��u����?L�e|Zc"���wtl�z*]�N����a6��Z���
^e1	i���)�v�@m���Zi���_��(������_��M|j���q"�F�Tx�>�l*lmxy3D���{���t5������I")n�
\4��F�O=�>���t1�<��+~\���7�����ud���d3�����\K0��k�$�����T�&����
���`P+�u���v��o�n�w�
 ���d��;D�x�	n�z��Z$��������k�oH��F6�*#�m����<��6����I+���t���%xE��l}u����
��|#����f�/w;8U��8�$"@xCc���]v1�����o{�3)"#<6=��Rp�)k�d���`o���b<
�g���V�5�=W�x�;(s���T���h�7(g������f�nX����������[�X�W��rGi����y/�{�q�c���j��������`��6�5R��t���xRNaW�1�z�\�R4�W�y�W��BK�Iu��q��+���@�t�������b�I����d�m9�7�O�m�d��@UF��1v���`������z���������v�����������\�o��"Le��J��T ���>�X��XJeB�k�`���2A���r�{���RT���]5�9w:\�^�(��D���5�:��-��W�/J��2;��Iv��!�w��6T�/;��<�R���)oC��*�-{M����M�b}����Q;S�E������@�d���Nf��3�E��Et�w
�pVT�A���+���B��n��V�m����S[��&�8:��d_u��~J�����]�-��{�I�e��u&�Q9�0�"����������y�N��L�em�@��
�&�^/-������L�� r�
����@�.j���	����<C��(���	*�N�4�b�H�_�Rq&"t��z.�[�e��H���-��>�8�z<����d��'>�p�?��;����c�8��)P�~�,�TS�o.B���(���P�y9�1����RX����������g��{���i�����`���u�X��u����a�6S��w����v�q�z�;��0��C����bhr��*��������5?`����,�q�c��W������s�����'a���#v��|������Yq���
��r���W5(W���X�������%��r�#_��l�|�����-�Fi�q^���d����?`��^BP��T��\k��G/e���1�s�����pX�J`K������������w�Q��A��
-���j5{�������_���T�ey-]�&{	i�T�EN���|�KG������C���y���+�nn7�?�����-�|2��xn(�{� 8�pr���|i�$���oDm<�v)d�t��Wro�����Xv�)����MbC[?��/��$"t��^	*�����8��U1��
G��n��u�c\�V��`�)�X������<�T��d��F����7�7"BgO�u�����8��	�����A����������y���y�zTe���t��������:)��lZ�C�8������
Z^��1f-tF��M�/�')�����#��<�/"��������!���"����'����s�N�]���K�2P�{�N�R1��.v���|Cn%��y����h8V����\��3a81�pS��i`�8�:&2��n�_�=U�qE��v��8�����[k���:��`_��������t�o���^���6�p�Ltz�hGF$����CO���|^2�Ru��F�x��vc����_����"@� [7";�������0���q�
c��C� J�V��_:��6�z�v����5$q�&�#"R������=[�i>W�}�BB��������q�ZjZ�>��,�!����RR��*�
�
e%��%*�B�=z�I@��z�q���j\����k���j�s�?�����w=�v��y[S���M�J��a��?��F8y�����!*�����������NIc�T�P:I����oP�|���8v��y(~1�F���w���)iem�"��G�5x(���]���1��Hs��sJ�Z��N�`���Cg�����2_]�1_i������1FJ��#n
�Yu�X*��j��zE�Q�n���K��M���?5���+#P1���d���*1��C���W!4,06��U.0)�U~#�p�����!��N�~��i���A]�|���'�&�1�H�����J��W��<��������L�F��%U���
+��QA^�M��jx������r���l���I;�9_�/U=/z�'�=�eRe�	��_������S���_��^w��)�����v��P���&�@��>o�	���d�@�T���IR��u9a���Q@�u��g��X�?��g;�V'��=����m
��P�t:7�a~���:)��*���r�kU��k���u��3�`�����6����/�$�/9U�=���������<��{>�8N�Q�M���Ys���
s��G����w1b��/k���\��N�j�r��#��2�	����)��\�r?�i���_�)��h�m��$�����1��&X8]}z������*k��	��g��P��8�w?���'P��8�*T�_�=Y�
��� G5�L}\��qf!�K���*�
��Cq��7��Ss�O9��r6_�6L������<���kH�sry����c�F��q>�C���H�xY���%����sI��Q���;*��n`�r����c_l)��/��x��-xw|�=�.��9Nqy����/Q��&T�?�-.���;�Y��=��ox.�Ay��hO����m�y��\�~��fd���GZ��?u3m���w�G��6��;e�O��X��6W��zID��bW�5�
m$�i�k��dZ>b5�!S�hXG~/e|Z��w�
�����������o���_���?$�v��O����x�A����x��9�3Q����XuX��_��5n_�)|����7�f����!N�e�����W���"�
,12t1������<oY����EmzH���r�3e���j�=a�IJ6qp�88.��?�z[���Gm?�Y�N��-������"��yU2Z%2g\x2f0���b�|�M:t�;"���:t=5��[�d�����C���$�?�o�E7]t�,��q�`�������Nto6�N>�?,|"�H����>`o�%�JV��)<�a:fZ�a3��d!-��?�1�7���x�5/���������3�7�"���$��>=���||�}��q��p;1Mn3)�j�r���y��:#{�Y��N*6<D�tm�H��'�_7W����q��0��j���s��j�g�|�7�I�>�(��������w��<��2B?�juV>��>�i��5"���[�Z=�
�C�������e|�Kw'�Q��|=g�?I��Q���v��r[C��cj
bts��W�t~���$��[�F�p!v@�����e-z��2JH
�;`]�qo�f���
�s�:�>����P��;��e����F@�����M����!z��x��r$��\=[�����h��sd0=?��
����H��e���pD�q�#����L�,&�����/	w�/�,������-���>��M��'�T34���r[,��@It� ���h�0i��EBy��P���;OZ�IKM�'�:_���z��9LZ�*�O��-�I�� U0�?R����{Y�71>X!��y}J������_����N������C�(��?���������5j���9>CE������;�-�F<@3F��Wz���:*}Cyj��%��@Mi��� {���)��� ����@��S������_|9F���RE����=���ZK_nz�j�`��X�X*����a��p}��N�:�F����EZ���>P���nsP�6�����2}��W����Z��kz	��*��\�Et`�B�Z+8�5�|��7��dVU+�U>�M%T���)�D��,5�/����7�m�f}�t�k���MW��v�]�?�
���0e�7@Z�?wg���\T'���_�������]{� _����s�S'�}r�$��>������S�� ��������t�F����zu�`�_�PGj��:4C�w���	]�3j��
��'�Kp\P5.�D�����>�����`4sZgGH��/����8���o�;N�T�"�!�%�����52%����x#��$>���"��[���������;;`�pgW:��������2��C�;���&v�UM=�b"/�4n���>9�
�u&��i�Z�,��p#��l}q��Z�A^��?���y01���Whg>�MbM?�,��[G��y`�{��=W�UD�����Mk�#�+�?�y��x:Q0f-+^P�P8,��=�dc�`���T5H9�3�js|:��~�Ed�S�A�U���x-3�yEA�I�U8s��C�b����������>�E�I��?���S��Ixj{N�";u�E�M��7t@�(�e�d��8��='�X�p�TS�6A�O<'d����W����zd�E��������w*�5\u�s���v�I����C0�3�\d��p9f�t�x�e�����@�w���(	�W!�E�F�"�7!��f��
���������x1/�(�SN�a��w���g�~H�t��U�kB3{��zLa�tP�B(bWD=0��s� _m
fb���G@w�	.�A@�'�J�>�o��c������=��B���&Hom�!�L�����}�^���4�h�NC��5[�CT������SC�6p�����^������/-�C�'��(���H���l��a2t������:����#���Bd�Q����Y?���~��J�Ef�������n�Z9����/��s/�3M��U�<���?�0��I���M�EDCCS�V��6��������Z����)J-[����6U&Z>`�F�(�M��z��=]5^��F�BA!""NF�.��V�+�Ke�-FX���{��n���^#�+4W�q���p��:��(���g�O%��DU*TW�����cQZ���MJU��� y��d& ���@����0�=@.WTk�\��52�����it�*l1�_��</
"��B�b��N'6������U3.Y#��! �)0SW���70Q�[/l����(T����iI,0������}��Y�$q@������C)��6�g��^|c�R�f����w��`�}����=�{��v�]���2���b=%�!��q���6�>�����j&v_���mH��u�H�M����q8y��v���Tez�ZT�Gn���,�;��U�������G!��&6�������f�h_�����{�H+��h�=�~!��h�y+�%��4`|G3���^����;�w��n'N�n�y��'�	�X��	�A�����S��F��-�Z��d���x�*[NG��z����*J&'����rE��gp�������K;����>�5���z{5�������U}���{�[���le��U�L���x�Ax8N��ZZ����<nr,o��i�imP���4U+2���W�6��E
b�nFY��(���Nd<!�p��_`i�����2
g��������*Db'��8q�J�� &Uv���>z��
��f�jm���e����C�*^g�"�POO���.��)EgQD(��}��hr��,��d��,�����~��B�z���{w&��MLPY����=`���R���9?(�^"�70:Y$�OE����������r�Y���O?��o�����}��q����F����u���`��y��s���J�c u��H���,! ��_���Y�ImF�����V���CST��C�1	��U���N���s?��u�4:��1���^���3:I���+N���?�Y���@a(UhU���~��
8��^����Y����$�m���{r_��1H{�WbK���� �����U��>D)V�L��F�0��x�����Y��-�!U~��>I
�e�S����p��R3��[��l�^;���n�]���=+����@�?g'���E.��g2���WL*�{�[vn�����a�a*�N�h�V{h�8-��P���E�x����i8��|����,�e�
�vb�U����]����@e��<!��h����_�����Y�s�y���1j���*���u�Sr��5��HGqB�h��b�?�1�]���Apr��L\�����~�������D������3�L����+c4}�/�_��[`:��)�
,��qx>����];�����/��EE�p��w�[ F ����/�y�D�(�RZa���]j3�������?���6�������4D=���?T^F�:������e%[���{��;T@n��~��:,�N;��G��<�	��j"��W�~�B��@�+�]V�,Keq�8�����U�>�K�����
�<nv�4&l
F
(�lP;���{7\Mkd\.��9�aG?���EX���`����1]���31�r�$�<� 9�o
O��l0�����t�ui����eBz�5q*,���ZCr����@G�H]�n�Y�QV�hcp.gd�Y���;�Dqw�h/�vl��h
��D�^2c��=ax�����Da3{��{���z����	Z���a0W���*!��LQ��D�r���N�C��`��X�r��y8C�s�B�'GJ�M�/�0�0�zb������{!i��n"�����+����u��������qNR��@{�H���"�%E�)��`�7�x�X�+�QR����0ylsd����vn��3��{�E�`��Z@*k�#J�����_~�O^��#f/����[U}c-V����Y����o[���P?�7�I��R��~
M�_�vJ	���St�

���\8��t
S�M��BS���s�����eH_��$������T�F1	�����8V��u<C���V-�b����a��pf�E����_��!�@3�B��-^H'��y��L�����BL	�SJ�I�q���1�HD����qiO�}�JZP!Z�DN4hV��V� 4x�1W�=!!v	wR���G�a����8���J]�c�P��x#��Y��}m8��!	3���#K�Kc���3�����c����V�aJ����8@�����1c-��/A(���"{����^b��J���QO$��Rtdc>/��?�g���<OX���>�%�]������kV��2H��(�xbs����;�)���b�
@_0��R�):�
��12�}������N������b�L�b��
Vr�"�f�xX�e�cLE��D��3�?�x����\��Q�lH�A��-����J�����;>
���d���/����"x�����Hbx^��.���4��+�1�R���O��X�7�z���,iB�Pp�y>t����z�B�RQl��V�u�X�0��m������7S*@�E�;T&�h�!���/RMfh����sqh]
C��:0���������fy��Te�b� Dz����|0+�wm e'�0&�#�l3�0f�B�5�i�q���G�M���
~�)��2�e��D���m;�^h��%��m���>e��7*������sy*bP-�s�\Ve�6���l&�p�A^�m�]�)��3����|��#������GL�����G�=�4S3`*_�������~�xK�f[��_
*���_dk��&{i\������Pa��r����F/q��~��;l��S5�����������g��qw�k
�4pxt�[c��o�u���n���y&����:yVv��������Oq}1���|�(�
Z���W(kdZ6n��D�m���\�����82;���h\����ekf26��d�q�~�����"W�W�D����U
y������MC���+<3�<�+�X4��1G����_�r�)���W���/�C�8*��Y��$A���<o�����}#
���*��0#������~F����
6��)�
{�K���>�n��K�ugk~��OG(�p�Nl��?�
4�Wz3/����+Qw5����>8D{���+� ��xJ�w��5���^T~�4�~������4o����!��P�1Wk���,d>_>��(���\�)5	FG ���\����������Ok�H�^�Vz���R���*�$��������&�����]nn�$�]�Mp����$!��$���]��o"o/�1���&���U$���wi�b�n�����UZ�Y���q�g�t�����8���Y�(�)�"5���\GPud��������Qz�e������)�18Q�i��}�W��pW�V�?0�sEm�����A�X����s������.���[~�W���KrCPvO����@X:(`����Y�|�f����<�T��=l�]{�����&�ZRm�p�a��y����+�AP�(-;��������T�{S A��#�46FJ�b��L9��hs?�����^{�Dw���7�T�H�a8��������X!������hp@x��~~�����\��|T<�f�H�X��Q�.����q�x��wJL���>;�B�N���E���)@��j>��0k����70��t�����x��ClN-c[0o��V',�������-���$�����vy��,�A"{�
��	�������-0#��}�����zs�i�����a��)�dx";����7��a\��nY.�g��g�CV~z���z���4P�[��W7������p�1����rV���d��}Uc�?>L����2���q@'pa�������"����v��i%/Xd�|�B��~8��Q�O{���j����mf�u���av3]���	: ��]G�OD<�g�+7�x�H������ex���a�^{��Q>,s?	�>B���P�;e�oA���WW�r�����Bt�B*I,��M1�i��< ��a~����{�9���'��!! E3���Yd��__��F��\��!��G)��W>��v������/�*d�x���|>z�O��7*FLV/	N��;��9�~�f1^��h4��R��?s����N
T/c
���$b�3s�s�R@�QFz;�9���C�3��J��z?Sd')�`����W�{�K� j���G�;?���{1Z�;��7�{b^���nt��P�yv���]�u�*=.�T���-P����=�|s�b&D���������Q�G����H"������)y;L�n���5��P�S-55�<�)�1s�d�+f�$�0�����y�2Or������P��V)�8 hi�~8�?,�����z��q.��Z��Wx�����
|`���kM�F�=Z�+7�N�|vscJ�y�����a_�w����B�0F?4G����a?����%�
<p�BxId��
W;Y��������>�����
� ������h��N�7��4�O��%\�����=>
���`	������sM$)�m>�	!����I���1M��v"=0���3����)����
��j��D��aK��&��FWu�s�pt4��R��<@[_��$�9����	�(%������
����F�1>�c�4�����:S���
V}Y�T����71��lPsw����jN@�F�6��f~,f�D�/Bw��l���	�c:3�D!13�"����[ajqc���~j�;�o�N��eVZ�
7.�#`M�����U���;�]���9D�i�a���v�������<�r��Z�����ja����LT������e`kNq����l9Wm5C�Z�Ha~��o������\��N�i�o����M�.x���]����R}���i�h����C2G�e)��6+C~�B1>}4(j�u�mH�o��MQ�Z�;>t��x^>5�&b�H��=>OF��q����o�s�W�q�Ai`2�N�k:
�$�>����_��K��&U������`��,m�S��K��
���EG�_�{�=W���r���z�g�({u.'6^����
M\�(�1J�����?*��A�
����=&��7&Q�L'�5(80;�K-7�������/qRZX��js�7�M:�G�*i�.�A+��'H��F��b}�Uf#"��\����uY��4��wx�V&�7�2�^`^%��e�}�*��C��|]@(Q�����YvVt|�p�������:N���#��n��a��$7������(�*���9�������*��2p�������e�L�?��J8��bKi1+�j�l	�����j�#�c6y�����g�LM�/��>'Z�a������0��R�T4���x�rpy��E�,���6�*:%k�t�V@A�����\d&�I�I!0����p��������4��lLr�-W���<����[�<G"T/S,s �,��[�������������ps}b���7�Jk��~�a��q�c�{����(��G�	!�_�*�W�p�(-v�vw?6��@`<z��85y2L0�9�Rv��6���i�R�c(N��,��z�����&�������������*:�z�7P{?���9'�&@�sB�Trydo�KM�E(Q�� W���y8�P�;2�]
�%c]��Ji����7
�&�R��������)�zi�M���i�0Ql���y!�I,������fr��'~�U����C��+Ex7����)v����!_�s3�'y��D"���Z2��D�D�9^b�^g�m(�cn�l��`���g�`���;���	�~u��o��x3���7��W�'2��u���EC����������F��v�=J6�-~3���M>����y�q��H9�qnV���R�[�o�����GY!h�����b����$���%��;�����S��K��:��}Y
��^�#�; �`E?����)�*�^����o4�Ya`����L��5��)�zo2B���wk�������`p���U��+�Fc���]�P�2��`���d�����
rmE�,�g.[O^��7-�&������H�!!�=�����"n����q���%��7�hYP���z��V{�%�V�Ln���(�E�$$K��R�
������E�@���	�PB��x#��@0o>BIk-�u��h����>�mVQ��^g�B���T,oA#-t����?���!?�����.�������1����i!���C��'���=�
0`g,���2�������4
u.�U
V��E��u��JS��.�;�C	��Z�l7���=>��`<���`)]Z�� i.�L<B�^|=�OZA-`)%9�G@���#4��������P�Y^�w#b�/��pl��k��!����/��}^���T��U����U�z�~J�@�,n.|4n���(������� 0]ps�6&�d�������@�i��%����w1�����b�m/
����%}�]���������K#�����/3��{jD��-VU���3=������C�!/m:��^����� �k�7����R_�GM�lz��#���$/o��X���{H�Q�}<�u<��CZ��r�����#u����@qrZv��2T����*P��"p�t�L%�V�75����o8�l��<��(��w�.�TN�l���8"jn�CA1���^��������k�������f#G$�Q���Y�[�tNc���P�.P0F�}p,����.�4��u���^����X�����>��\{( K;`�9v
���=G�pv�Z�u�^�Tj�sby���|�@To��7�
��nQ?�(
���(���_���/����a������z��B��80Z�8(�����54���*�+���6��aA������$h�ZDK���z��r@b7�N�E$Q��������~��
���9i����F�L����H��D�+
��`�Zp��$t��U��&��p��� i*�N���rMHK��S�`������c����)�g/�$u5_&���$�#ry��]�J/Z[�1^��c-l2�f�k!�$�<���Z��6
��:!�9���>%/���)qE��T?����d��~y�k���]�X�7��l'#�G��d�I��<%s����"�L���)��B1BGpV+yO�\$�=��NE��.8?.;�}`	A�V������A�T]2QX�����h��#�wL���D����6_��I5�%�K~��`��8OV�$;�R��fB8+�1�
!%O�����	�Q3���A��klUaI,+d�
Q�di2�,G��Q������d�P=�K�����H��@���>��b�F�{5z��/��0���0!gu�������������1��a�P���1�m�K ��T`�r������#m���pj��i)&�$M���\�:\�_��$��o`��n>��e���gO1�	3����Wf��5�AV��N��S:"�A�<�l �������a���\��@�g���`�����{��Q0��'6G6�f�1A�.��IP`Mp�vs��f��b W8�0�t���2�r��\S�AJ�+��K�0qg&�R#d�;|r�(}
��R��Nf�/�������W��6�@+U^��^O��F��+>�T��v���'H��g���H���O��������������x��#���r������5�������+^e���� ��dAc�Q��bm�a�m����#�wF���$e�H ��	)\A�S��2��,�������Al$��'P#��Z��ae����
��S
�#�v��hq���7������4�{��������X���C�|���z#����fx�������u:�v)$�K���<�1�o/��;+�����I�����8c���%$-�(��Gu/���3�hj���;:�G���L]�p6_�R�\�/�g��M���y|�H��^�r2��_��hQ��v�9�OP!�Dw�yp��3L����A�J8��}�������3��r5W�H"���k��6z�uu���F_E�}�h4+?�[X	������PL"�U�����gn�t�CR�?��;*��/f��q������(S��a��MA����!��'�a�E*���c�)W'�I��Mb�~J���M����(N�c."���5�K�:�m�S3�BO��AB�>UN��5zr�2�
��O��w>��t����c ?��[YP���D�����q���}�yg^�Y����-��,���A��0����������pc��z���R���T�Om����V��mHA�-�k��Tk�y��|�S����or�AM?+��t�(��.�k_/��������'�'E�����>;O����-Ef�>����.!�����{Wl��V�����b�l���K��=
q����K��P%��/����-�l�����.�������.A��
J7"�"=��1���8�������?��#����/[���3b;X�>5���ak�|%]�j�N�Z4>4Co�t����,��N<��n8������e��h&�����{�������,��'���O^��*�5���8�g}�8��>��m1_(J���-��B�� &�peVPE����WB0����:�&���|���i��&#V���
(�3�����W���{�t'Z���,!���'������[K�-WF,k������r��������l�,�a�|>Z!���F��X�	�=����c�&�:�lG�x]�^��D�r�=���"�5���F������K[�W!�J��g���.���D��d����,P��0����t�D��&��P:���jw����<p�*��1�s"	�0��Sy��@^e���������������8��]���g��|`����|n��� %�.���z��Wy�����n�LC��{��J*�z������vM���D,D��������:���e��]�m�)�@�(�1"4�}3��$��#�"0��$\�����e�a6��	�7�*�%�L�q��x����&!����C�@��<\�+��(�DR����8r�������z��{��>��D��?4����_���DA�5�'D�����D��3�Z���"�v�h_�v$h�:�y��=
��N��qZ��Y3��Sm�2!�nYB���(���c?2\�SlC�<O� A��b%�KR��%�����3����z��g�g	���:���o�����>s��>�'���w0��U�}�n9���T�oOQ�}F7�/
�����'H����e��4Tv~�9u��*�5���`�!a�lq�����������I�a�;�[�}[	�p�e�n�v�&����a�eVu>���e���
�*���M-1����k~}Xt:�5�~:��&a�����9J*��=Sq���������p��������L���%�{�Y�	1������U>(���L��4p�`r��|�rE�������S����f�j��1������sbq��J��6~�MV� �j@�7��"��$*CU*�/2�Y�YJ�;�)�0e�$�k��F	��+E��i�`���b����;Mm�R�>wYbS��h��W�����7o�|�[g="�����d~f]V�
�����l�

Z�r����YJ�2l�IS�VX1� �������,���o
y��a^�$p=���b��*�N J"Z�/Yx�@`S_9_�9������1���u�������z{���w���gy�O�����*s}	)�����F!�/��^��5�{����q����f��#�����(�[������f��q�=?R-X����H�>e[���}$�����`�gBKm���!����Z�5O�[�1�)o�� �h�m}U���~����X����2H�	�����,�{���C�(��j�M�]a���0Ca��)�oA��P��pFx�+e���7��*Le��J�a�6j����[�	�u��8�mQWY�Z�Pqt�`��(��V�&��r9��(]����}�R����~��=��&�����I}�z_o?KEx�Y�Y%�:�r���[���E�����0d�
�g�!ST��x�,J�
yR���@Q���S����oo��uYh	���Q���m����
�Z�]�-���0�-��>vWb�;R4y}x
�{$6wb+�l�{b*z~=��q���o�'��R �]�N��g��*��}X�}V��"�w�M�X]�~���ZX��S#�N���s<��)�v�#���
���_�-�E\�9U"�lKg9U��x����y�����������3���w�������W�zy���k5����?.of,r�K�Vk�����#1����a�t���B���m_�B*~�F����/��0����$��]:vt���uA�u�DbOFv3�#�c��"��d��QYU��nU�W@�]<}����~�k�����z)l�����
�3�(bq���}���\Jc���A�.�&Le�X����~��x�7��0
��$6$���\B�����C��_��rf������AO�cdr{�����?V)�"z��G3gS-���|�	� kl7<E�r�lJ��F#�����a���X��{��c���d���?O�L�db���0l#k�J/���;1Q��"����������]����d���o6���+�2��g��9�A����@P�a�z���Z��b��������{X����%�F�2D9�=5Qq�-��`�TQ>���n�����?��c�}�L
-���p�?3�����=������R��������.��YU�;4�rhA���-'-�y��n�Z�,���'�&��%gV�G2�l=�
�6R�B�r�dr���U#�o%�������TX�:X�u�_'a8Y**���,��k��������=kOp�gi���k"2�2v<k����V$^~j/�����`�=v�q����D��l�_S���*XM����
;�����R��gK_��B��)�B�|���[eL�!ZC���=��~��R����2����h�\���yE
]XpLa�3�O����]s���~v�J������.�/�
���Y����&~va",'��l����Yo�����8��k�r?��u>�����B!N�fx����j��9&G�L,#�2��S\�`���\����n��a�
�[]Ik|�w�K	@p���^���wX��R�WZ�����V�L�X~�(e�'���=��^���_L��i5$3g�9��K�t���WH�����"��dy�rzx:W�����8���1�&j�'��1u��{4g�y�Is�YUj�����`��>Q�)���`��7��
��$����f.����S����f���{�C�':�[��`
7e��tOg�W�n�����O��xpF�/���s�z{$o�s���qj�[':���I��-��F��7Y�����aVh9��;s=E��B8I������vz���TRV*Sb.�B>V?������ .i��/W
D����n����@�B�y��6F���i���;������E~��*T���i����j1�I� U)%PR��3j���Z~��S�����f��"���F���������Y��O��TD"�g!~�68{���v���Z6'���Z��]>������\��A[�����
O ���uvk�i����
|�,�*������~�&���O�U�x�/:�x�.���j�B"pfl��bV�A|A�����d�S���P�m�V���*����r�����O�&��,�bh&���[���3�W�^aj���`��Z_a�?��Y��"*�S�?0��IMX�T��p�qK�HN�v
��_�7of�{gm��
��-��m�Z;��[s����R
��f������m���"�i����X1|�Hq�n�k���3��"���~���)B�$(�IK�< �Z��!��Hx�o�L��%������n��}�r%��$��\v�J��w�_����h��C�(��u>u�T�jbS��X}�g���*F�Ion,?����,�qIeQW����LrU/�5��i��U����T������ ��|Y�d�8*2��uDNy�_����+��N:��
/-��ob@Q���o�u�T�z`�i��L�O�Ma�q�a����6�\�P��`4�y��P�z�p����F��x@���~O�N�"�Wh/��wEUW��4��AjU�UOe�6�0k�U����{����b8%�ty�}����"�^g����S���W�����+�a����Wx��T�����O�"���%\���!%����2o��X_��+9�Z�P���.p��U#��z�h�
��$P�.�=����_�z{�A���rT�~��(�;�.���bJz��y�g��k+���<NCY�M�}�Gt����v*���?����G��`=��?���=������u���)Z�.���|���'1 ��N8��f4��*R���L-�}����<`k��oaO���6�7�>�k���a��`
��"���������^;��
8=Qd��z	�(L����(R]�]��c2 8k��E�����f��s�w�b ���q���a��Y^�:��,��d�k�r�272�fV�R]����_���rNF/l7�����^���VSz�OAd�N6���U]k�0i,��s�}�PT`�/^
��+?��'��rR�S=�-�~��4���������@.o|�*Q��
�G���Y?#;x5�
[��mg�?���p�V��G��e��A�,����M�@yD���
��y�u����)��Cu]
D�i��HR[]V<�P����F�9��P���Q�(�)B����s%�`J���ZM'*�l���
\,�t\�����1��f;���)�W"	�d!���QU�j�>2XV��@���0���h�`W���L�s���Eb����i%Q��������qs\6Ou7]�����y�r��Bl���F>*�H�v��b���$���[����,�=��lm��K�5�������/�<�f����N�
�l��(������WX^0TY��	)>s��
�guV��!�Ue���i��O`3���I��u�H���;��ro*@f�+>��nuW�7y������Q���d���?R�B�����<����b���<���d�b�f:��,1�T'�y/E,�@�Q%�UT�?�������'i�}�MI��:[=7�^�����&��?-�>5����X�U$?8-N
���R0������.���A�|���(��;n���7�
�8��'�A�-�Gq��d� 3�������#vx�gY�N����@�|�:�,x������zN���
���36;�����.�Z�\4���+P���>�4
�����,At���������K��I�����)�p(g��~�6��y�,�}?i���F��OF�O�_�����?�Xy��j�eS����tw���&�����+[<
';c����HN�����4T�~|'����������bQ������W��eFO�x"���>	~��v��&Xk������4����g��;QQL�r��g�?�}9��Jx�0I�g���*pz��u0!v%�s���=������_�����e�T��KN����r>n(w7����y��f�Id���`z,��V
�m�_��B��X}�	�o��b�&�JW�=M�n@��j_���m��\7	�zI�c���%N
m��Qo���R^���������Y���8a��[A��pP��z2m��Sq]�5,MWY�����f]g����:p��g�����J�2�(aGT1>�q�=�'^+��O�
>O��'�����|��������;�������(���>�6tY���H3b\}�J����l5��������8��h�����z=�B�I�
��gf��4p&�h����jv�&G{�������������/r�����O�t2������]�G��G������M_�y�f+��Q�������E���G
�.�9�Q�Y�t�7���?�c�'7�.����)_�e�s�.VQ���NJ���^���m��CX�^M�R-	�����������H�gUu8Y�]�t�O-=����3)\U�����������X�
�
����>����f,"�qr�~|�^Bk���x-%2�mm3��Qg��u;��1��FZlv_��z����?i�K��m�;��$�r^������/�V��~I���3�nH8�g�f���#�����@��z�yY{#�G	������6�bw-���ORV�R���K���8]s���wt6|��N ��sG����'��T��#U�4a��>�����R�_K�9��Du=��f�U\=o���5��G����:OIKC��4�7�/ZQ"��7�3I��;��#������$����}�3Wn��\��O�����u�z�k�E*�B����/Z�V����u��9�63�R���������_wS�Sd��J[!o`�c&C������d�.��w��n���L�iqx�C;�$Oux��#����������z����i6q�Y��Y*�������13O��y�H<.9�MC�p��;Qp<�C�"P�l�h��`���4nvHO�e-M�}�I@���V��/�>U��+f�!���M)��&���qG��}�'�'Y�0�	TU���k�k�$�F
�/���C.gX�\��"��u	������^��S��-���w�����J��R��`K�k� ����>��x�S����an~���|�b��l�?}��wT��]����!�����U\�P���+h��J����p���o�����d�*�;t�
�R�iR�M�Mc�[�!���k�����ips<Y������~���]i������g=�{Q	�Ty�g���2��B��a��b��va%�S����wA^=z����+�aN���C3k�Y���[���T�IE��TE�����df�*:<�)�M�����4�������7e������y�u�T��������"��Z��\�����}�^��q��[�e����p��:�����\_D�����D�(q�!|Ml�`z��	q^��zx!*�����I��l�����V
����M�sw�0�3:�
IH��9]���B��Z�(��7�p������'a��p��P�	S�U��2��0�8\�I���A��~J��W^
o�hF�*gI@�XZ1�w�_m��]2%����C�t�Z�2��r������h��f[�\����'��X�DVT�������I���z-���H/3�+<��^���)c�p���8���A������]�&�|��ymW;X�n&���-;s������?c4���`�C����gUEbulZ)�/��;���<o�ry����Z��k�E5�������R6���[�����BW��,�
:I��D4B�K"���^�^|kQ�2kf���{{2��Y�N���0K��}<��������V�1���.��Z��{�@��x�Sb�>{fl�H���AB	���D*���������fm�U�����&W�f*n��O-��~�K0����u_U�]���	3���mZ:�w������-�(�p�;(��2��8��b)�z�19�v	���h�aV���5Y��>l��V9�j�������}^9n�[/�~?6=������������$Q�d*��������k��7�bZ�4���4c>���P�����eyW]�gX�+���W��fY�����s���L�_����R����4��c;����1�x��)�pL��>�[O�Gf�iM��muU�<
�������P��8U-���Y�1���}j��i��^��5�������F����S���n�\��\D�./,���#e�*7�-����wT��F��O}H�����E��u��NUv�����=�``�
�T���Q�d�;6s�U���h,>�=���/���j�oWteFEk�����wFc����o��S���0�Dkjb����%���?C]�������0>���?��Y��V��O{W��4"�Q�F�h��j�c	���O���g%E����E�QK1%�����F+���l7����Trk�~\b_���:�����m��+����|��6e*f-��M?]T�i���f�L�#�V���")���%�y�������w�z5�94_2[w��b��s�������b0c������l�����ry�*I�
��p�>&���8����3�$jj��+r�	FH�j�O74�.��{7�]8%m�m��w��DX� .���H���Q�Y0��i��,�1�l�Lw��_*������������&�����|�����O���)��X6~K�c�_��3���8A�%f�1/����N4>O�����E�hO��R[�����L(>'��M��c�������*��������O2����D^���GQ���d�����e���H�.�����r_�?��K����	�cGU)�h�F�D1?����]���;���n�e��8f�
<�Q��LL?����C���i�[VQdr��w������m�E���}����#T�FAS��q8��|/��N��x��?"��B�X+������58�#��e�������@�&���;S�N2-�����M�J�f�	y�2�#��\)�j��C���r����x-,�&#qq���2_�5us�z��W�<��^
E�9����4n�
�_+�
=��F�w���f�4p-�_���]��Q]��z�<������������;%�%F��i��N��]
2&Y�m/�"JC�����p��#7�q|���L�]Q���E���������Q����9%��!��0�R}��ne���Nh�Nh�i�-o��Z8�5nQ�p�2�4$!,���0��7���g(�[���(�5����=�A9���\"�����Bz)y�j]s����*! �g����km�+�~TJY����A5��n��3k�@�
c*�����-��>��ZN��^��}A��O�����e*��L����a��Q�VR�j[vLg����
h�xi����r��Lq_Qu)��(4]N�F����^��-kK_-�E+5���C��<����l~�Y�>\��`Y�@��;k��z�k��T����?�@%�����i���<E*��)<{	���X:������O����C����,��cm�;���s�^��Y��&V�����M����2�Q�8�R�i���t����k�����od>��_�uB�������f�@-q��JCm;��g*o��!�CE��\����t��s��n�%���H#l�����\(?���h�1��6��ng������������K����#��U���f��������*Bg����;��	[l���4��Z��Y+�s+:��JB����M��pf����=|$�U,���s��w;�I6����y��]By���]K�i�<MK�Z`+T���y��(����+SWM�����?�7����
I��r���L��f��2����������Z�~��G���@c�2�H�\�Y�J��'���k2���+a~9������,����Ij_�F6��%`&�1���$�I|��vwG��?���[�N��T���<����X����x(�F��|�k�"x�>���Lb�|~nk�wv��C��N������O��F��������r)yc�X�9(���Px�HK��{���KbU+_�����Z�?N����,��*�L���i`,S�3��Y�1�p�,k������3m��jYW+3D4_���i �`�E�%���=J��=���l_������[eh�=�'C%^��}�E#�M���.j
�����l������CbY��_(�����E�$�|�������k}��Wd^�?�������Y��|�?�U���-�!b������5B�C�H��T��XVh��`b�4l�����?��F��������e3T��������uT���?�"� �t#"�t���tw�tw�JI+���tw3�����=0�����y���}�������Y���,��������{_����a��#6�
a�P�o�{Y��M��;���������=���q��j�y
�f/|�F�����<� R)7�Pq�9�D�����!uk7:d��B�
/����8j�P���W�h��G��"����7#8nic�k�0��i������c �Lw���#J���w�^��Z�a6��O����&_�v�D�N���c�O���%����]Vu8��Np3;E��V��6���F��.�tv�cm���/R�mT�7;�|�0��9�N*��'^��f~d
.[u���X
d��=j
�Tp$�-h�-��OY��r�*Q
������3��[0���	^?y����<Q?���<�E7����kwC�xy���H���������D���V�k�^K�=fjF�����(�4
���e�q���J��^��c<pK��zI[z�3�^	��k���qu�vx����$�B�&�q`.r�P�G���LLO��B��y]������Z��&Qi�hd�m3���{9����
\���,���g���5�QcZ�),:o�������
�����K�?d� $�-����4C��g�|&,���9I�y����������9?#�����M���b3t�BEC��g��g/���M��w�:����N�����6y���I!@���h��b�2�:��������?�<�([���K���S(�6�*�	5��zk^`1�^�E���T���S�W��:C�f{���h���@<s������[��O���v���,�m1W�
\oH��D�hc�B�;��qNo_2�`{���/{��]�tX�N�#���x��y]��GD�d^n�L�B��R){[���\�f��6�>J�1ZEvy����$V��8���������S�=*`f�����P�
6�!M�>�0uA�.�pCE""��O�_q�pr�w�y�7��"�`�
������=B|��q��_+Q4�|����B��m6qS�X����>6�_��"?�p����Q��N�'�r��y��>�zr~�;_�k�JA��5�Bl�i���{����yt����t����D�J�?do�<h^4�L�����9�m���uA�{�����4��N�z����gY�����G��}�.�=�	^e���E��:#1����Ei�'I0��b0A�[m�O���j��3w����n�t�]������q�w�m832���LS��@���fd��Q��5f�h�B��������%���%�I�E���C�����RT���@]a����s�M2������/BmD7�-����u������M��y$�[�Mcu
����Wj����1������F:6
�k_�l/WE��m�U=�\��1�~E�18����V���!?y�6�c���^���p���]"�A�4�ZOd�za���,�mQ�c�+?AC<��B�>��,��KzL��J�m"&�Jy����ln�~���/(�����R���Y�Met-I���> S���|�6!y/.r�<��EHc�.�9��>�����Mu��vu��@ES1R���k��W�����I����d�s>J�~�,m��)����(����`?�jv5��(82���1�l@Dn�U&?U+������*��2�`�Z�����^���������A����[�2�:��%�3�z�~zB���|s_H�6"���I[�T���U��1��"q��WbW�7��q{�h��U�U��,_�.��$�����L\�-
e���D�RG���5��f#�1��z������J�IcMd�n���0��I5�*����go�
�����r��F�5�0��~o�=e���(�����/(���g
��'�U��=CT+���_
�=��G�u"\g(y&N|��U�����R�^~r�l�_0'�G��Bi�)j�A�b�����(T*�<���������N;�jJq$U����H���AD6c��	���RQ�L�1eWe�fq�$�H1A����oo:�ggr�Yy��aN�
��l��+R�Q��QW��B����9��S9b�@U��;S�WK�h�����9IL\$�u����J��Q
#�"�+������6g�x���� M[e�b!�"H�.1�$yUZ*�U�����[>ui3�B�{�(�d�$���gE����p�����iF�DnD6��?�-7�WfZ�u���;I�Ve)	e�sQ�p�n���	��,����7��Rc��l=5��������V-yv���^���`�;sA��g��x������5����2�� ,~_v���y�p�t����djd���d2��l��:���;V�z�F�[|�����Yv^�F�BD��;$��[tq��;�a��WTm�]T�77fW�my���||[`*A�����{��+N���e�E�~�-��p�����lp�'��v���<v+�*v
2��=�Lt�����:8�������+ f�$<��h3��~������|��fr������+�P�4��B�{�R��c��1�/�T�k7�w��dN�a/�K����p}���|t��H�����U��B����k?�y�L�R���~���s��T*>��.N4����|2
(��������{��>^:7�Q9�:G�L��h7����J�����
�+�W�C��\r�x9>�$@n2��/������a���j��|�����rlN&_ v���%>�GNC����VI��\�M4$�=���U2�g9v�]@-/g�������g�&Ck���
�R���=u8�cWN8O�P�/��,�j
>�����x���A8g�2� N�#Y����O���d���$�o�4P��fW���vK��4=�����mc�o�`_��{�iv�~c	�Ew���G��&1�i2�T�K�z��R����M��a�4ZZL�7��v����	a������+�g�Q��
����J�����z>e-�}mp�q��mEM��������M��8���6�1��"4
�_�kM�U�:�M36�]�t�z]�e���Zs����%��7g��2������g���:�s6w���+
y�'��.��K[,�^��>7�G�X������6�/����������W�Z�'_O�ZgH�u���$.\��\fU�w2N
�V"dj�T�7�m���9��Z|�U����'������	�9*���Xa�L�����\��(�w�q�kD�V�%j��M��gI;	l�D��>=%O*x����kq?h����IQ����4U�xl����#O�}3&�=�Q�'d�#b� 2�UvOw���=wz�������3�����:a��V	A��Kd_��]�6���MCy���<�x����4A�jm�3-z�L;�������{��2�sY|�Y��������9��Z�Q��h�����G�-�eQr��G�8qe��G��x��j�����\�;��^#l��3C������o3M~�.���r��\j�_�P+���c5��U��m�n?_C��y�4n�����W�q��@������\E�L��-U�
<%>7G��gX��o	���?I�����*���f(��[�_�������z5�G<;K����:CC�%�+�[��~)�+�����;�!��@�����a��5At�X��rO*y�h{'c�.�����E��[��c�w�!	8�`���vO��r����x6f\h�o#�wJ�f��J����B���3���JB���t���k���8��"v�*�e[�P$�0.����M<O��j��V��aAy����~�A�F��k	G��@�=������d���z�t��;T��1x�Q�zm.��-xxy�x��������k}�]pf�f����?1c�{:~�i����I�x	��i����l����
��"�����}��y�YN���j��%�v��kwj�%<��r�Q��-��'�
2�>���qZ�	�#�y���*0&�sR����-Z5��lY^"��WM�A�5�w�2K�]'������ &H�K���HT��_����G��Q��f���~�����.E*
��	l���;E</���D��.�e�����48%,�	�_��W_�1��~��px��5�p��/���dHa3M��S��6
�M@�M��(u����E���RX���D^�+v���K�p?�%h��N�F�"m"�NWi�����x(����-',��I
�{_\�b%�
�cF)�|��.����A5`���7
�N�����p��|&��>��Q��
���e��� P�2��x��4��(>��a�Q�^�B�)z��\!H�AY�3�L~�8�H��D��kO>�����o��j�V��fP�z\r(,�����WC��|1����

��j��h���<�������:�a����o9�8���7����.w�b}�Z�������D�p?���6`��@�A7�I_Z���`wh�z�K����H ��"��i���i\#8Y^���b�/�k�w�	p^����������P,@�����C?������Op�x���`�.�B�)�i�#��p���yJh	���y��q8����I�&'$O���n`},������G�]���,����d�������C��-,��-�k�(�mu,�������K�9�M��BZ�������,�$��2�;3�Z��lo������A�c
pyX�(�H��#��<�I�h����0���ai��
����k�5a}}�"��i�\������l���h��k7m���b?�q7���� ������!-��t7(wOa0h�0�����B���M�O�i6wZ_N������0��]/�]U�����[Ok��?m�S^�`\�N���}.�'*O�.���"�+���A?�����+X"|�`��i^��1>������asFI�l���
��O���
�*4�*��54�G���?�����m���Z���/�H�<�=�M���y0���S���Kx;���1�T�E��!���.R`�����a[^N���7o��5I��[��X��VOO��o�GE����Hp�������W�,�W0�������b��+K�P�`,B��?���KjmA�]��������/�_�������v(wJ���Z������/�Tq[#����ECh����#&Hx|e�/r�
��Z����>O���/� �&�����<N�_3�>/%m��t$��P��9�G��Gx:����O�a+���_�� ���j�\2!����T(`�>�9C�E�"�#�:��t.g�l^�C�5����a`�\���39H[�6�n�]��������T���Sjw*:�C�H"�����z>M��(\�,�O�|r��{�!v����)C^^,�a�8']��m<��z�����Y��V�����(�`m�0�w��^�~�&%�����)&9d��������:%�B������X����`FW��|�����DCU��g�+\W*��S�k�	��v��;������Z�E_w��B��K��S��;��u����(	5��_��}��5�����g�:��k|=���t�g��Y������KE�L�/�����@l�����~��g��0�����L�������Wv�A�w��1��C�v��V4C�}cy^����M��m�`�3vgh��wl9y�[v:�@W��QK��!�I.	k�m�V�bkC��H5����u9��k)/��x[�x�83o���L�Oe���k�De�����:��t8�
��D����:3�/�nB�Du=�_o�[hQW=����S�<�~��V������B~FE���0�U�E<T��pTt�d����jW0�~�%U�������P�Z���5�$�1J�LPC�]7�+�%>��9��+��|�p���3�F5}TT=���,1�E�@^����$�����?���6RW�9��!b�IGK=���/z���E ���Nzi��u1��-�5�o���*{z��	r���Z���K[��iA�+&����0�F'�2��d�C�l������Sc��H��xU%���D[��������/T��y��2�gO�=��s%Zn���a�b�<d��[���i�����x'���fA�j�<I�����_$�WEv��@>��v�yS�a��,}�v
���z,������P�"G�,��f\o�hH�����_����)Zl>�������7%�����������Z������k�z�+	���-L=[0�����F���
�l�6�U�A��l�vx���8�T3���X����T'y��f������{s>���a�Fe>8����+����K�3�;g�9cV\�7
�Ua)�-f�����{<�d���K�Lt��SI'�n���n���	x2D5~']P���-t���U�����*�0'f[��?v�n���������� ��C`�x�9�gE���U9z]��r�:Xb��]p�x�����Lr'��d��	��${�Q,���{7�Ji�����G�w�Z\*[�Fqju���e��7�X�?��n���o ;&�c4L�m�Z|
�K�+�`����Bc����w�k��Pz��)��N����-�4)��+��-���at1�T
ha�f��g�o�z�@������c�����
&��A���Z)��D���Ko>�� 

�}��/�}-p}�n�v�����7n�GSE�ySjLG��q�X��7A:C����5L����;{I�U.�8I��9�/4-����Gk�B�!>
�t����K�gG�"��f��J�L�e2���Z�����R��'<��T���}��V�dUQl��I���N0
�%s��{F`1�����O;�Nu]EvK����u��
Q��i�!\���V�(��f��()���%~%"M7V���X�C�?��!	��y_�N���B��o�Jl�\e�����"G\O���!-&x]�
�����RooP���3|�v�,�N������E��J�ooV^���+�#��qI{S*C��[c�����M(fG=���<
Q���
 K��
M[[uu��/����
G�0�� 
p�?�:���9���C�n�r����n�y�1�w��8����M��K����t:	sU�DJ^�h<�"���s���7I���6P�����^�y���E,��D�����V]k��3}w�g���>����Yb� ZS$���f���y�k���z�����
4Xz��OY �C4���Xr� �>w:��������p�x��Y;Ba�`���q�SQ�����`��g�q;�XX������)�1c��S����A�E��GORBq�D�l���|>\��/���T���6��P��2�$o�4��Ihu6�^�����
��[�����8��ob���0Z��}��#�9���u��{���-��Rsa�Z�x���rm���2��0�y5���Q0t�m����\������S|��e��x;�^��N�=��CU���<����9Y���r��F]�L�����M�����7hQh�^��
�]�\���,jB��D>�u��X��������<��L���8�B2��L� \i\pc-����)���a,�a�%o��y1�K��l���V�c������S��<�����qnf��)60X�i��-i����5H���kH ���y�D�bX�#{yS�2a���h����R��_H*`3���\�O�:!���Z�-����d���D�*�b6���Y��^;�������M��uv�BN)���l���OEJ�xQ�R����n�������=���9�XQ���C������2�[��`��gNQ�J~t�������5�yr�W�[��C�sn:����������}���W;�)�q���!A����
���J��������tt����ne:xNQN���1L�V�8hX#����N�<�B�=��s=���0���_9JF;�~�z��R�M)5s���*������d��c]������v�!r�M��Cf�a�9��#XI���II��O��+��������G�����Jq>�Ah.
���)���rG"�v8�Z���%�%�����U.�xut���>�l�7� V�I"~.H�'��_X�b� �����Z��#X<��kM��a/�K6	F/�J%yFu��!����4u``�[�����Q`z��ff����,�|[�!�i�IBS�m�Q�5�Wp���:z�uo���*�.jZ)t`���2������Z�}�y��z'��8�VWa�������y}��s�=���"������r�+j��HP�����������.���4�t/H��/G�X��Z���j�%xp�Tg����d
_�,�����e@���W ?�N�va��6��6�:`����������]���\wgG'���Z��"�
���'�4��i���#����A�2���l�R�j���������w��@,�DN�|H��u�w�Ef�!���f��_Hd�U{�eS
p��	s���9���R��A��(��
���dy>d��(?G�P��"��z?�
��ax}k
*B��������oe���k�����:��]��]p�h�F���J���!:P����0�m=�Y�[T��f=�hcE�U �%���L}_���5���@�A����BOO['�����������Is������+|�y����,7��`�f�Y��.��Nn^
%��>���=�*�<��d-�F���2���4���M��H���-1���5=��5�ch�PKz���^��r
���:�]�3�|���J0a�������o����	m���3������b����Xz�bbQ����h�}����������Z�W�uvv��1���\
��"������[�&6�#KNtPHW���z���v����(�����������������e�(�>y�E j�g�f������W�G���X<E5��B{��ft���6:��.S$4�d���f����U�r*{���B������KP���`w�=�uwj��|�u:s)���*��=�9�[��@����Y�,��+~�e?�A
K�S8�X��_B/��h����\lR�h)�R���|�6z�y�V�f^���C��r2^�4�����������DKuX�y���/����P�R��k�7B
�������*����i��B�CO�P_����_Pa�]��?��+��=�5-����g�d��Pl��B��7�-(h��|��>A�k�lA�)f�6�&S�1�X��4��%��,s�5���	��J�������$�N��L����W��I�~&J�Ngzb��*�3��{�,�/[��-+D�t�Z������X�Ws�����Ra�gZ�R UO���~�r�������$.������w0Yo��Vyse��~�P`:9"����WS=3A��	�"m�0 �}������'��b'8�y!F��lr��}"�������e���%��N�a��I�Z��w���-�`��������Rf����c�3�keo9t�.)V>H���w���;b�L��wj�=�N�����'������u]���1�TWL3P*�����|��G/�{,��5������g��*�����s�g�~������y��Lb��W�f9�n����G��kwE�C�/�Y�z��`!�,�k1{1��\,���@���nn�������,s��v�3Y4��T�OTZ�".�sd����������$/)��� �<��3��]Ds[I��>�8����_������L�]�)��g&!K����9�%�/�u����	�I��]��
@h��,�g^�y�������\��%G���C����x���o�U k�]�A{�t7B����CzdIU����~�]�h76�2������D�����YE��z ����~�A���w�2���{��������/����%�������C���P�c��t0��R���0���+�t�����i|�J��n��"�N�:C��(���l�*��O�;��6/�*�nT���#�=i�jb���&��oGa����u����x�s���J�
@�o��Z;����G�/�C�7�1a�{���aG�<�#����s�v��!��$4�P!��V��~�#���|a��w�}mo���S�������w��K��K��$Z��\�<!�-�o{�y��&�'�(9L����
���`Qs����]�vzl�L13�2����M�Vp�+�8�v�sQ��4�Z����X�a��w$i��}EQ��Wh��=��<5�
��P2�d�E�o �u�{pI�?�>m��S���f�d}�^Mh�L���C�`.�u�#B��O���M��rs����u���X�8_cZ��j��~cH���<�����<�BG��2sc'9g��z�)����U��b�����d�1�\p����,}_@��H�#�.>��{@���J+�v�sf
o��������VT
��d��.nH$�!�t9�Q�_'�
��;��$����3����<�t\ezD5��p���0s�c~\�9�=l���4���p��B�O��H;��fm�:I�vg�Z������a~%��5fJ�"����j����UE^7�Y�L��k��9��L�]�����cj����K=^5[:�G'�jJy�������"/��t�ZG����"3�[r��o�I!�ey0-8��y_�~�F����cZ�qt7�Wk%�tMhva�69Qn�}�$fp�=��q������-m��k
b����  c��$�/:��q�?�<����>7��L�w�C-��34`1}�����'�����(x�9��zC�B^�����vj'����5*�����?5����:��
��)��W��"P$�r�L��?����3�8�#��~�K���Qj9���}�O&��2%j�����
Y���8����_C����:�����&�������h'��/�j��o��L�%I�h��8G�;�����B��
��gX,8���wlE�zMw�m���9��,�5�p��?	�����c����|U�y�:��x����G~�5��o�����������55@�W�VD���T�A��>����s1 �5HNu8����3�2��c>�n���A8�1s��:��Y�5w�aJ�Y�x�u�f�7h�\�+o��g'���0p1Y�o�]�w�b��<��]������d'�����$��,�>@�
�y0�w��]k��6A����s��p�����B�Ul��e����y/��B����"~������W�7K���}7$��Re}��b�������s�e��.u�QB���}L�6�P�W���d���1'��X��;3\?����M��V����5f���>�A>@O�p����dP��\��SsF�T����$�z�*~�����vmt�nV�C�B��D���B��EE�P�'3Q�pm�}*�D13p��D��	�.�!2M>��b����s��Tz����Y��	�g^=-]�AB�-Ie���>�^�L1�L:����6�U9���S�	����:=-��R�I{r�'�V�I�/1T�,�<I��c���\�6��F^��}�Z�����S�g!�@�.������wz-,UCg����&��8���"$D��Nv�� "l�����i�+Y���$�&uaid�i�1D�qf-0��U
)E�}��=���zOF$Sc��U�A��b�F�������Y�����]�����'Fe�[��UGd�;��a�����N�����@5l.�q�3�,c'�Zb`��a�Aa��*NX+�>t��DyX�����3�ZpC����pY8'��y0��V�{���a��X^������`9&�s�.+3��4\8��������p��������X})�H�����/5�}LI�h9W�F&����c�����<fmM��#tR�fm��{3�T�m3b'��	��������l��e�TGw7��-N<�Lo ]=}j�����*�s�i(x�/��I����|A��4]�$Q~3^���Y�����b ��1~��EG%�8�c(�<�S���i�fL��wq��k��=��W^fe����j*�(0���G�����aA��[��\��]��4�*����o�2�g*����N��]��G�54���
���)�����6T]��o�z�jF����k�k��4�Ypf�\&������2^E�G��oV����PG�
����	���Wzlb��!��]�8���:���,����:�eSL3e��������mS9��s��}�j�C����(Y�g]�L�N��/{R��F����F�&��l�w�,�����T�4BA��J�@�����@�	�R���}7��]>�Xm����1 5�Ch2��SJ�e�>Te���k��,y:����\Ws�sK�����X�V5��i��������1�o'������D��{������D|VM���(y���d�X�Z���/���N�M�"�:�<wlo9���u��N��p�4�1{����}629H_S��v�������������k���{��(��s�{:�K�S�>��x[M��O��v&/6���@�����A?p�Z�H�����s������Ve��F5�P�&���H��p��m�y��m�5<2*�����)�	�&���O�����o��� K|��!�=2�
�@���Hjb�%�eLL�5cS�u��;�H�!�!��|���8^��s�p�?�A06���/u�{�����_U��9kA�uwK/�:��|,���W[�`��1�~OGS�,�*pB�������/z��/&z����7�:���R�$���i7}�e"77�
��b@X�A�
�E
�;4���d��E�o�8�T�u�n��gP,��~v<\�����)�������*\��f���_�<��B�}C@X��|!�t�V���������]��M��Q�rBzu
��4��XU~�i���;�~��>5e�������Z�(��	�}��s���"�"~�W�	�������=���^�������h&������xWQ�������;w���(����{���w�tVY����2��s��� ��������������������s�%��������������3��??��,��������|�.�|�����G�����u6sq���pf�����������s�����	��f.F�IJS5k�r�N��;w��%���h�����_��3�?9��%�q+fogni��d�bio�����b���ll���?�����o�A������3�+K[#��Q��;5��9�;�:�����/�����z\��m���l�G=���#�������!n���i��q�M�6�F.f�-M�A�����-��Gu��88�[8�9��.������?���Z�1��������G���1Ah�5��47svav25�����,���$��$3�k$�*�BJ^�������?���T\�#:`o�bf�� 2�-�	����;w������?�O>��N./,�L�<��_�X�Z�0��:���+����y�������������<�<w�8��\��\wX�8Yy��<e������p�NO�"��9��|�����~e����!���g��'<98�X���;X��L����-�����L�M\moQ���dfd�������:�����������.x�(�Kb���M�7��W_��?�
����c�(�*���#����322rpp

�~�ZFFFEEEWW���������������?,,,666)))33���������������oll����������`����p����w��WJ�U��������&��8��@����>���\{�:X��S`FP������H�rw�NY���+t���1WP�w�xu���/�R{x��k�z��|4�J��Tp5�KE�s�������j��E!�H�K:�_���V�L0�.H����KUP�R(85���w(�������4S<����*3�!Ni���{?�'S'���/,���)�V�,c���
���R(��q�{Q(�)���j���5�g�j��Q~����7��������z�]���5�|V�M���Q��j�����P���R����P3��<��>D������	{M���a��Q����~X���a�5��N.�QzX%�^��F�������A|O�i<o��t���t�O~N�G���F���W�?�q���t;�Rl����S�"� h��!��E)��po��S4p�8�b�ih>Ep�X?-���*�gQ�}�����c��"E�C��1cJH�a#�p/��A���&���%��h"���;A�vnB��q�T<���](2��`Z�`�12K��v���U���4���4�%��c���C!L"��#���vQ�������wR����z����X*����E~��n~�UJo86vx�[���nG`E<k�I,������Z;�v�H�����v�����r]��q��?��5�d�S�����,�I��#}��W�~$������H����!�gK]��4�a�f���?�%�yf�������]�M,@�q(3�p���Y���	�[�w/��a��I�����|&QA�i%�O3����{����[�)��<�jt*�����t�$�>d/����6���*���r��6r�S��8M���ttP���I�;Ae"���l�������d�C�=��N�Y_�W��o���x��kG��0W�y�E{rFx)S���q���U��b���.O�:S86�e�>��x�)>z����4��X]�|��qeU����I�X��\=���dq�Lx�X���[e{����zi���(5o�HfZ���j�����B;�v*�W
Ih��H\��2���e^��eIm	��4�G���'�J|Y��^�d�����`7�V�~�A[!VE�!EW2�Hr�g�|X����%.�N��"o���hK�
��F�^.��d~G���s�c�����\�=}����K�n�g��#���!Y�O��������zg���pW���W�%�rz�0���s�$�l��\�?�\����N�����_R4%q5L�_������}�(?^"g<��[���>����9��GC*��kY�Q�=�f�1F��o�.��'��
x�6 ������.[�����E8{��Q.).<�&�S.�
w��b��Dy���J/�N���_������A�^�h|�8�]I�q4"V�Q_�A�w���c����Cd�����yi�jvw����q����y����k��"�~�`��G��b-�_���������o����{4�&	��z�� ���Q��B��v�{����az,��.�w���Q0��N�c��/��b���G���c[x$��Y���93�1�0��Y�n9Z���8�`�-+����+�h�E�{����vn=��}�wH�?��;�o�X��X������'���$��4���uyn������8�?��n�������tZ���b4���[��'==}{/kf�jg�)�M6����v�s��#�^����1t���=�vK��Q4Y�n�L��:��AK����{���9y��Xm�'B��Y�=�I}�������$TN
�p^�������U{TFy�v��<<���2��f}����zJ�x41��2�_�e�Q�g���<ojD�v�DS>k,K��v�k'�4.�sA�A*�sF�� GJ&��>G�#�N���F����%<{�:�Z]��������x�c"�y�#��>K^p6q~���J�!{f'5����^�e"pF�V��Ak9/1��!Q��VU�-��k�>���ntg�+�'|i����'H	{�8��P�<�d��>^�Re�������gF���X���A5��������&��R��Ry#h����|���q&NlLzei�G��DQ�L�H4q��BR�>��3�b�D%rl�������3�W�<�75�}IZWdh���El���&�|3�^M���e�"��8��J��|nd2��$��w����g�3p�hH�$D��"d+R9����5?`I��p�����:�[_�G�O��B@��O���}�v�z�S���5=�9���89�������~��{y��
!:��c��af9l��o/�S�vvH,<�&ts"�Y�d`�L��6u���_hS1~�����*o������eYyXIsuN���QA������i[YB�D����/����_��[m���������e���2��=�L��7��!���*]��Ij���&y
g�*�>����T��h��
Z���or2������������X�S�
X�>�e��t\�Y��R�^�?a]lKBS�<��� ���CJX	�������'dQU�����U�[�:U9�~��	�BV��u�M	���
�M,�o��]�z�N���R��ncDi~����m~�}��"WY��np_�{���07}�~.O��x?�����
l�r\��^���*"H�ph���]��������Dr�Y�'O��L�Q(�n����Z,�e�������g�-q��p~��Q��B1U4b�#/��P���
n�1'4�mqd3a��s��jX��M���!�`i�Q���{�iYUe���3J;v�v(�-���~

�n��FRk2�E
�)�Mr��������sb�\w�%���},�0��w��/gvR��j����l���k�R��"�$�p�znu}Y���74jy�*�M���d�~%��X2*w�b!V���?��|���z�A���V�������5���s�|������B<X���;����<8X����	G�62�[;=P��}��x�$|�=����QdeP�so���4�������eo���b�qx��v*�?��Yh")��8I2��@z�zz�nk�C�~����:g,pV�M�Tz�_ ���B�E>��f!��(=�6:��o����T+���sN{��~��+�(3o��;���)��_h6l���
�������S���gE���J}�����W�l��>�c�pV��/���g�T����6]�������XD������k�����)��@���>(6K6��,���p��K�$���E1�1X{p�[N-��a�Lz�]f����nF��Qz�������94�O���d��j�*��d�V�O��`����|!2��ql�i��#�3���^���JX�[.��t>�[d�+��
_y��X��t[j���T+#�W�m��S�w~�oU���z��/�������i����Qf�#;�[��z����v�rn��d/h���{|lz;�s����,r�Q>��5_��D�8���M��{_��<n�	��Y,��~j}&�����mK��+�(�{����d����9JX��3s9�U9h@G��V��w'�p��#ry1!|��3N��|�D$i���E��$�"�D%F�����%{�������	�;��-���}mS�*R��I
���XN��,s=Y���x�[-Gjv���������-!�8�����d	�'~�
�2B"�;Kc� �G��j���H�HM�ad�&���F��}4�>���j�Q����LxE7v�My�AHs:���,�_^���gMj�P���e2"�<c�?����}��,m^9���<����+)*��W1�E�#'��c����H������r��c����cw�����hb�7�4�2�p)���l]�z����w�L�b�@��r��$	/��������jtX��|r��$����1s%�T�p�o����/oYI���"�J����g;��M$�&^�I�5q�����;�����}j�E,�$� �8zm,!`P����>\��5�4���0�O
Z��a�����j�
a�G�j,W���t���El?�`H����q��]��*,�SX>���x�,��+!QF>��a�}}����{�`�����x�i/:�q*:>��a��{M������V�3 ��39�`D�)f������2�tyF���:����L��,H�H�M��N��g�uf�	.�R�P�$��8J�����9�Q���L�����o��*M�#�E���m�-QN4���2��%��������]�B~���V������6���R�)4�5|mC�{���Y�q�
���#��TY�D^��Z�1)��Bu��*��@(�<�GS�0gbA����{��&��i�e���^EX����}1C�aR����Z�y[�xd�v�E\�jf�qPX�E{?����N(�9c�sr,7����'*�������(2P�-�#��!��g�@x����g�� ���WL*�l+���t��e�69@������ ���
�B���J����^@�*
�|��p7Z��D88($���I:�*)�+)�}���q�t(���:�3��v:u��0/�ql�������oI�9�m��q�c�6�Q8w�}�����Y�m�]�8��t�P%w(Q	~����Jh>���	ah��;�2�oS/����V|��9S���Z����������vp����5_��kx(OnK�f�B��05$�B.6��gL+�fq&����*��~��Pp�������PrgA�9$�=����,�VW�1�@iq`"������G�Z��aMV�u��_o\hP�5����K�����J���8���[8��:��\�&G�LK��7��o��x�J-��c�OTQ	����j��<�4]��p��u�S��!@)��rC�J`tld��)O��\���+�c��d�@��Z�}���V���i���w�U�4�(�h��ED�F����w[*�	�r�9�nk@��@�#8�P����&G����0�*�nag+����M�>�]�x�"�S��,=g���$)_������LC*J[�h����t��{�gw��I��L���B�3e�f�yy�c��
K�������?pQ�4�S%"��k�St�G��r���\��69+��mf
� RT7|
b������O�c-	`����"��S�5��s�9e,��')c���S�*"o�TepV������$�4�>~���!N#��#���CJ������,Mo�G=}�j8X}p��l)�dJ��/�*)D��P ��qtt�\�}�%^��V������=]���ZY�'��)����=���* 
�}�2���Xi��4v�*�X���d��(�Q<����`�&\,2k�"�YXtjP��	�����aj����wd����8<[�Gz����t����Q0i����i�{$
�>2��������W������y%���<>d�v�L�j����rh[���{�GSB��^)�8���!�F������V
�����a��g�l�05u����r�����t�DWd�26����*�;��RFb�~����Fu�!o����=1V�'e����'��&1 >�Oq!G�7�| �&�� EE�`��(5�w]�}/^>-��(��x���{��3�mqs||�����!,3�>���C��O����M��fg��UcN0�/o�g����S��mx��s��uTHu�C�'�4�b0�.*�r���3'cXv'���5�f�#��l�j�?�dQ=��E�k$��!�R��A�N`�?XZ�����!|�������B�RI��P�.� �t_������su��T��F�pz[q�Y��
��F�_KV0������_�2��������g�����o�����g�����><O}<��v����b����]9�"�h��)�jG���6�������};[{^�}<Q���/��6ZTEi�#��������4����_�V|��]�(j�����x�'���U�j!{-�C�=Sii�ss{^��������%�u��zj<ry��jg��;}��� �G��L��]�+���#qlc���vy?����WceW����E���D���eM�u��R�FP*+���M����(HwMC��a��}C��P�Z���I~a���+*�8�&P4VQ��,���Ow
T�v�R�3�j�V4�	���O���������>T�|�)M�����u��|��7"���]k\�i
�����qd;����S�yQ�$Q�Y1^�^�,����]a�=��f�NG�Q�����}��*�n��@������{��`���m9���K���d��^����"�.�qFU4(�{��+�e�L�a��V�u���wLO��	F�Zhet���Qt|�_uK�A�F����"��RMc����������<P������XN������6V�Q!g=��YX���zl��q����m�}��&C�����j���_v���������0E��*�w���s���������&C]"Q�=�\����H&;��-Isi:�������J
cB.�2�HJDO���|�;�>�����~iY����
��!"FC�I���0�^d�>E�RowU�\��2�|����U����3Zya~��{�p�f��V��4�M�|t�Z��D�D��t+������!�fWlHU\�{�=>�z��lX�~�2[�r�m����
�T�5 ����TgT�����| �w��fF;����O�X�����E�{���:`�w�5���I#���ei���S�%�]�
����MN~���~BF�y�K�Y�=.Mnx�N/�Y���D��@��N.V������������e����:��:�����1qQ�_����r����A���xR�������7[��J�)�b���>�����!���8�^�X�K�A����7�,���s���q���@��C?%���������-�r1�E���LD���.���Aq�i�A�$����sxOG�
���P��m�ck��>�ns,���74����D+"�}����m p�{kqGA3C�Km7�����@=������j��Y(�O���B��
CL^��XK���N�]����N���Zx��M����VW	>�����+��ZZV�K�c>�XQY)H 	������n-���������.��6E���Y��8u&���`~�L�5J<v�3�mJ�%��Q�k-h&$�u���Z�:y�s�w��7���ph�&�!���Nabn��4M�AH��L��bwC?�����`W�<��n�������=��Xc�=p������A�J�R�q���������=�{�Do��E���8��5�����s1>����
�v����<��p��r�/bY�������[4x��X�&I�
��%K�c��W�'�s�1�+��W�+o���M�u����}+:�\��$o�\�I4	mP]��
=[l�������CZT�>� "��T�����G���:^td��z���XW.C��cH�juC<:��I��p��e��d������Bn~�H=�Y|jLC})��u�(����\���$�{K<2[��7a��9U@����e�/d��9�>u�AV���/�sy�r�c�u8��'m���k~��y^-E_Rn�O���p����������>��Dvy�����*�~���i�h{A���B��1�6A�v?�
mS�y0O���`R}|;v(=��Jf�I8���������������k��w]�G�#���f�Hs��_��`_�H�����#u�eef��f�:�5�:�����5K+�������gK�R[�J_�>�^�PyC����Gc�q��7��V-�~%���{��*��2���b���|�89y�����40LD�Q�?5����7��J
�hl@[�}������hsY����3s��S
���U9��j,e�*�j����m��_�]1�'�����Kd��*���	l����zb'�����O���K&�N?�������-B1G�)/�Y�6<k���r���_���0���H�R�&s���'��i�������L��D�������n���b��h��/��?���z���������+)�%�,}��
�"3�s����!�]�%���Nu�cl+����G�b���E�'/'$<b�:��>
	p�0r�$�U������!����J��V���:K���� �_��M����xEOU��B�M��:#;g���o���v_��\����rN������]����	
�hv�]�0����\����5����/{w��ldbGv�Z`=z�W�:�y~�M:���-����S�vN����r�������V��������E���tTTa��Vr�-�,�4�r��t'��������X����H���Z5��'�����Z�)�<5
s�/
"	��
�sL�:�.?���C*���Xh`�x�N�������K2)�l��S��no���$/��_����x�D�������&QQ�������F=�rN��Z��l��1og����6K�X��~k�t��d�_��X�G�����	Y/��g���]�pKw[hz3{��w_�[L��,�}�$S���������j�%s���'!
��.w���}��|�����x���QfU�:��%Z��v�m��4��8y�0��C����.S�:�O6�k�F0nJ]�fjCO�W~4i�����6�s��������mP�}����y�)u���3�wpY�{gu�������uC��FE`�:�
��Rc`H7��q���n�����yQ����Io�$��[����>��j��s)`��/��c������{lh��� �+:�s�A��>)�{��{v���K[���>�������O��v+Z��v�%�&v�n�;%�6d}_k�����������Qw?g��of����W��������J�<�q��7)k�u
����>f�����t)+��R�i�ze���+��0�k�=�V�"e�qm2���Q�}�u����>-��{���A�	}��B8w�M��C8T}�+��"a+������u�������C\o#@u0p����6�I)C��P��y���p�3�����X�w���A�����N��?H�gp�;a����?���o�������<z�����gh�7R����o���	��i���Y���;�����f��/Y����z���~pi"��$b/4G��+}��I_-��X��L�7"�+�mQ�4���}��j�����.�?;]�H�Hvl����P�n��g�����PE�G���B���	^�7�:+#<$�xe	^�!�\yh���������S��y��o�hF��M��O?��P�7����^^��#�T��%}���i�?C���1rhz����u����]�E���^&^�c��T����
;
��M��h��������L��O��'j�%������?}��{����"�6����� ����Z��>���r�?��C����6�����������[�=�3��n��`"C<�0[��(����C�@��$O'y�H�>��+���x:���S��]!|r7d�@��t�<M�%��S���b�hgON�@/�`����9I������#��A�>�R�qzz7C+����2�Iz���"qe��|�ux�} >
�|�%���o��!{[������k��*G�R-��Z�R&�]P����|���������^�5ks�e�\O�+���J�����?OF���z{���#�{�r��o�7.�>��o�fz�o��M��Y'���hj���z�`S)�D�e�}]�#=x���E��'�2\k�%���pS-p���f9\����E��-����]�P���.�!���>�F�^���@��qOI1���tg�}�P���������Q"�"����f����jR�H#��z�P�W{��AtD|���o�����vu.K��c`�<�gl�P���Y_����]%���)ovEr�,���c�������iR&��s_�Z�o��`�4�\���k���fi�^Z����������}P�tg��O���	���u�nK5��^�l�_�S���'l��M�4D����o���e�Q}/��\�����{W�
�_
��;��C���i��4�cp�b~"G�w�gs���`�i07��f�T<tB�s>�z��5f�&%%
�C��X�\�!��C�Q�|�| g��Rr����9e�;=���L�5�h������9�s�k@�?)#�A�
D���M������\s�t�����l�@����� ;@/Oo�����&��$���FZ��B�B���F>�fg���N��sD����n6��'�:���}�W�c�W�Ms�i"���0X��/���xc���	�Q����(��^��
�,CK5<���9�X3�!�lZ�_��5�lllt4D]�v�B�)�!  l����`����j`�p�8M��Mn������PZ�t3�)A,�Y���"|��;����R�����F/W�
\PC?��?N�����]��C�����������K��'�v
��N��o�;�E���%~g���1���+Sfa���j��p�����f�g}���"z�?�@}.�|?+an�P����M�r�D�.��@���uA�����.8i)�t�p�����-~?���>"%�����}���]��H$d}j�
4��l�&�=�7�C�6�@��ES���j�D��\#��C�y�7Q,�kC����)
���F�����+LD3��T"��B_���h�� ����������j���m�/�'�#d4�!�
��E���d�3����1�����H[���j���F_��lvU�tt���b2
no�[Xt��6�|!��n�����]|s�[h� :���py��Ca?V����R��>��1����9��V�_�k�=^_��\���"�
�C�]WB���+��5$������yGpAZ�iX?E����6<�&Z9����/��S���������`�&eoo�s�
���a�`��R�e�P�W��p���������]��d0��|���&����6s�s-Y�9�P��5�N��n���J��nZo�zQ��\?�AQ���n��Q��M1f��)�j�l�}x�������Q�cYu-�o.7��X�@8(~F���f��I���Yk��0���XN�	l
�0��������K�x(���{z������
]�D�} ��U��Yp:K7|�8�����2�����{pm���F�/)y�B�sOo��q^�e������(x�	�\Tu����_b`yT5��-/Ox])�4��=��x��������;���['@�o���n3[��p(�y�����e���P|������>��.k��������U��pH���<,��MN%4&z��l$�n��
��� 8��;pW���n���<�R�zT����AH�G�����=������n�E��:�J���G���5�n���q�G7X�u�>����69�n�P5�K{Jt<,�nTUx��t����ZV��,���;u���	�-��]�;5�wm
�����P��T��(>C4wU�YR������.��K�3���T�u��s�"O?�F���n�@�XSii�}���f�-��MC�~��vq���]'��.��o�:�Yl�\����
dp����b9�U����i
vTC�+T[R�e�����QS]1yD�:��	�����8�g)X]-~�a���I�p�s�.�^]e�N�����2���MH^Ktv��
�O��l~Q�[G��'�W /,��:J��>i�����&�����	�
!B����KD�p��VJ��_�z*_ut�*���u�=\i��_u�
D��P��fA���������cD
��)�;A�;�{a
 =e��U Z"���������73��4M�f�p���{�O��_��'�|=E��pz�V���F<��W���`�"���NN�������������T����b:o�`��4"��/���7�	@�r|Wxc���_��Y^��
���W����^��B��F����L3i���=��b
��?��`p��vv�Q����p#���5��q}zM�-7��]��i�_�������}�"3����m%3C���Sk�5�Q`r��moY�,��q�%8��E+dv��?�K��V.����u�
v�l�*�����r�L%�M%6N�������K��x��L����
��6	mE��ySxonx/F��j���Z$��7��/��;�����+�6��^ig�[�s�]9s��cv*v�>�<��uF!����~C�kG��5tX��Dr���
��
����K���yL.�w��v��
������
�I�8�T�����T�7��}�n%m�N����K\�_�V��[���:^�0'���O��>�c^eg^=����3k���#����fc��A�,������6�g*����O���l�I2�H
na9e����oT�{�S,& �C����6��@�����y���P�,�MO`S$�S"�*)�b�)��J�'�U{_)�H�K`�"��U��05d5��(6H�����&�\�pM#��!�$l���e������_.4�;(�w6�����	5"�S=2~�b�t#�7U/��K��49
�c$���'O��H���-�4Qf��0��'�X�n��V7BL$X�gg��?�G��~'��5P_����V��b�W~4��� ������>�2�A�?]���SLJ}�(�i�����������$�������D	��wJ���J��k��O	#-%Ef���z�Ye���MV�i�����Hl<�P.y#-N�H�n��-
���'�.�u�����e���+�U�5fW��zK@1����nWjP:�5�=��lVj��_R������(��?Z�� �O���r���B���s,H	�K����f+n�>�3a���X�������**-
��l��q�� ��|�M(�~8n%������$t�5����|n�inS]}�<&����"�X���v���������B�������W<u�b�nUQ9��'���	<zF.�S}����������&�����/u�[LG�	�>��
�S�����u�_��}M ����1�����`���D�[��O3�H���g
�m*,Qt(��7�F�n���
�t�Ts�c�E��&*������4S���n��K:�W�lQY�A����{a{���pU�[Muo):!�@���1�9��=cOS1���
0�'M�2����!.I�,6�����i���{B���17��~��m��-��|;��2�R KCO��P]���,����h0�������!�/�'X}]X��Xk7�?��ar��!eB�LXQLX7r�����#&������K����$R���5��0�v@8��
{UB�2�a�Ne��k3D`�I��zj����9��G���:ob!}�#g��%	o�
�BD�8�m�43�!.���ub69��o�F����=����-���)�����o�c��>�e���
����L���eRA��U���A��[�)}RF�_H���G)����)���
�@h�o+eGy���&G$a=SS��t�V� ������&��>�Q����f9;�������"a�"����m��vE^I @(��gk.h0���H{D|�FhW����O��4ed&Ki�p��3����+��:�2�$�*�q��~T����{w"a�1�����2*�B�s���z������B+c:|q�N�4 R�2�	��
$��e� �6iq����m��[Y�'��M8�M�I����r�:>f\������n�a�V8����
�}d|�*7�c�[��
KF��5���<I���D�(�n���|:�,1�s�D3����7��7�;�o5������H��4����Z�g�{;��:�$[��Dsj{���AT����k����r�*���0�J%������Q���� �}1����/QLW�,���� �*��(�Ow�>���� �o�J���e��DtK�d8����Tk;{��/���|��|��Q�ys�p�
�7�m��U�D��(��t����"�#.�Xc7���56�o��S�;�_2��&��i@��h���'��w���|��z��'nX?�R�p>�K/�t~��.���^�l�N���#(0��������i�Y����#�O!Q��V]��{Ok��d9.�����t���).C����D�O_�}��,h��b`C���2���9��d?*���j=x���2��|�k�MC����21�
�
�?�`�����m(��?���&�b�%���)���[B�3	?(2�0�������_\��E����������(��0�`mQ�|���z��-���6>�WSr�:�78��������$S.+�e3������;��G���_&�����I����p���L�)��L������h]J��V��Ti_���2����FB��D����`�"QA�mNrT��|MV�%3R�����
�`.+,������8j��C~������{�tn�d~�A�=I����������@@���t�1m����{��4��	s�o�������S~" ��S~���o���2M���e�MaNU�CX����PaNU�%��G����6e��#�b�A9���4}I��R��O?*w���y���{2�������
���<NQ�o@?G��M`�~�a��<[v�j��sD���@8vByi��^�����G,F{U's���S���07��n�V���T��n�[��`��D"*j�M��	�O$~L���u\:s�)���3>q�Mq�u��_>��m�K�L��.��VjC�����iNI�1�=LvGQ�:���D�7Gn���ggk�O�^���G�!�oH2��	[�0j���Dq�[�����U���8������a������s�]�����0O��*��d� 1�E%�U�8p0q�"Y���h����3Tp@rGC�����Z0��
x�Z��$�k�g�E��i��#;���|%wnk��9���'66�D����Rs/�����J�/-d4����A2��TM��!J�_�Hcn���a�.���A�jQ������7�V+���E����-��'?w7��~E<�G�F@��!�����%yDN4�@�Y e�������^����>�W�1�f���	��>���Q���Qn-���xC77�r'!���8l���r��X*�T�����?�F*����|x��7���/#k��c��������'Q�DO��,"�hX���gZU�����'�,��Z����f���������J,�2�?$�����]��X�3��Q��D�9����}��
�<G<��(S����k�z�H�b����1���TW��}cg��=�k,���j��_�
n�=�_s�Tm��oh�
������/�M����^"�f�
��p'����s�xf��W�o��(Egd��+�g���R�>��� 6 ���NT�4�]�N�.��6���@�<��I��iU;�uI��Di�3s��D��
;�\��F\F��;k����&W���+t0��E�Nz�+_�WO*94{�f��`�m�=
Y������]M�u&JI��9R[�����	�����?�
���qB��
z-����Z����8�:�T��;
����F����@���uX�o�Zae�>��~�?�&��^A��j��"��B�c)���3-�����
j�fI�o6����pfT����� lf.6����$�����[�F	"<0��4������T*!�+'b��g�;���C/�mWj�����T�_8�o��=��t�+�����<y����H_`t+��n`�U��+B�	��-�2gh����O_l@
��,}����a���������\�S!�E/�:[�@�+������w��d3������P�U�LcB��_,��
��kV���]���/�m�"��x���T�r���^6C��r��M�Ib����bD��D���
�i���p�2��k���=Gj{%}L���n�Z�t2�nZ�AR���u9Y�86tfp`\��
4{��n~k�Di:�3�B1)�w5��o	�uxi���3sZe���T=Z*N����7���N�1�f�D!O���W��g����k��Kj����K2�?c^���?I�����&36hH�����T�7����9��^����jP�S5�Q?�%���>H&y��:e�w�N�2rv��41�~YR��H�����������E��������������E��S�N�����t�h�Ux�u%o������0�nS������(�7���Z��*]�~f��.�<N�����/=zk��!�qm�����$�w���8J���	���A����)�9������?+o�6?\z��zK�Z�,1i� ���������������&3��i�#��^�mZgU��mP�c����x�P�/��m\U!�)�/�OSF���b�`����8������"�gY� ��M���9"�,1�s^��6p���=�J?��}��%����|d!������������z�}��A9b�����eL��j��v�?#6�XFS���3K���al��VE3��,
�qUv��{qx��'rE�>������u�_�k6��3�����d�;�3�C*�z�~��c0�wcX�t7�j�e���� #�����8���	�����
������L?�$:7�
�T����Y�����t����#��u"� F��8�����������~N~�[/�ey���
K�w	��aU��V�g��f���bM�m;��y`:����O����V�oAj��
n�!���aa������~[f��Vnyh���] ��W
���&�T]1)\���H�o�������!���0�v���-Y!�;�4h1�2���NV���:Y'���t
p�4�H����9i\�����j852���k�d6=g:���8�o����\qQ�+�'�(-\a�
O'M'��R�D8�IOk/��&c�c�J�;��������A��;��c�j��+�a���v�����}�:[������R��1�fU���
���<�D���;�u<�T|���l���c�-F�`�p����jNnV3$���/�~��u	S���I�����XZ���S���C	�}&tA���[D9�*�F�����Z�U��v���
jy�'� ����a$k���,�����X�+�q�K���,$,�<�n��u���YkP�#�Z�
q�8��m�tAWl��9�e��+�TL�PV%�Y������X���^�����'4��Q�>}o '�N���U���2MA��p�}����o�����9��G*#�)^'����=��v����u�-@�Xi�i����~��7��	����
$U��l4�g���_�S"��2,�d��9���O��*w�����<=����o�AE���]�������-o���.�������tz�uDs|�KO�����O;���	R�L�
K�O\�����4�p�Hl��d?l)�����
|�zr@4�WG��78�K��5-������.G��J3��G����(#�Q�:�7��R:��V���9rg9�|����6�9-'�n���s�/���V����	7�Vf#%�7�F�&��(����;���!:���8�K�pXFR�1�^��>b�eW{$�:�-��Xa�����"v����R���0'�����vmNe����4�/��F*�k�4Y�?��sJ���y�+�D�,|�����m8��4�A
�
������G�r�Sb�_�A��4����L�K0�}R�z:%}v�Q7������t�d�	�>��g��4��=�;�0)v�'Jd��X��&�EJ|�����/��� ����[�cv��W%D�o���)m�	��\l�mrb�c�7Z�L@�3g��d�E��*��C.��~Ws����T�i��t%�#�axd��.Yj��/�
@������vY	������s��NU���$�G%�
l�I>��������6��@'��SQIn�.�n�����X��w��I<���������@'��k�7��n����7�W}K�Q_�Eet��]�����H�%�,����_;�S�a��?*p�_Z�x2
~�n�<��/��Ln�*����W	
7�>�I_�w�s�_�^W�Y1�Q�g��f��H)��"0:-G#�d�RB���+��p2��X�d�%e����]�ur}W���M�1�$B-�q	����u���V�V��j�c(�e��I��C�0���`K|�3oBq%\q������,#�6���e��ZP���������������}l_����
*����	��
�VBSw4)�p���I�p�R�:{���3�elo��kP�5�n&[��uY�jj�.�p��
[�V�=���>�/���mW�����P��=����O��e�t�x����T��}���A�E�Q����A�O0(�����g�i����[�VS�Y�����b�Y8��������9_�;b���^e�1���X)���7��j�r_,�
��^/�������=-��Y������%VY�;�u���+T(����_'�qy
�iN�o�E��GJ	�o���@R��lG�����Qq'�#q�B#(���]��w2�[+o������}�����h���*|�������)J����d������_@b�Q��U��Eg
D���I��	��o~�Q}�q�B�g!�w�IYaK��-`�Ac�g��hN������V�|i�GK(�!�RH�D���6���`�
����Y���\�:��a�:^��}qE���a���hjX��:4���{6��/�2���}Y(�g��Q��X�4.s���94[>�1��d%	P�$���-�C?B����"���G�6����������BDVs*��}Z����i�4utt��2�W�
�iCX��iP�>k�Bw{�w��<g�8�����~�y���ic����CC��(7�L��8�������H����_b�Bl��s�
\�����a$^�Rj�������\�:	J�
H�9��=�����Y')c������2������)��v��u������"����������+���������!��X�m��O����1f�B����~�����#����M
���U,� ���m�M��'����U���/��z8j���wq�z�l�qe��O��zG)?q�o�>�j~T��P���	V�.xc��IK@i,+u��6n8'
�R���o����>���e�=��@��}��@P(U��o�z�Z4�W�f��R
b4J�����{A]���?^���h���y�L;����-�����d������<�����B;k��S�^��Z������oR����H�I�/������a��
40�+��bw��i��C)z�`1O���r���6����Z�b��RJ�����"7C�?����s;��p;�x8-��[�t�3�&���C2|�V,�����Y\�7�d6�N�G1���;�(���[�.&���,��u�����B�j�5�����0j3���o!���j[�z�>�[}M��,M�+�WS��L��{���	�����V�����Vc[����->wY�'iv2�����/L�\�7���_�6������j�G���^�tJ�AG��u���nk����>��I	����u���o���^�~a�^�.�?��b<�f��7d�B�����>s�����\��>���j�N�D:�������2�mZ,�{�g��/�(k��?����]c��g��$���.�	�L��)��Sg��_�a��!�My�0��M(��*k���2�d�[;��;�T,��j�e�Q���c��WAz�����kGnS?��m1ph���b���0���D����T��I�8n�oo���1?����n�{��wet��Y*q���j$�x�Z���^����x����Y=+|Ba����E��B��?����*���"W�"Wry$m�7�uG���%R����'O�����h��E��Lt+���x&�]����H���A�+�\�CJ�6�3'�����i?F8��'&�	��n�yw�N������S5��f��*?��#��2_h�*K-:W����v+����dZ�lh3��D�V����o�M��GG|��zH|���5�Ae������Srk�/��*��?��BB�[�"-��(�������V���a���PR�FZ:��{d�-����s�>���������������/�}Ty�8jiqD�[��������!�S��I^5��p�/��#f��\��AI3�����F�S��=�����X6n��t��)n.�s��Q�fdl�MS�j+w�v�OQ���%QrJ,:�U�IoR$�,�N4�I�.Y�vbv�# I{��z��
�j@���j>�<^����U�i�����+��I��]����!n�a��q���*Q`�X��k����*i"��P"�;�
������"<N�7�^��o`�9�	C_u�
�_�gg��~����F���������k�"��Zt7�L�p�C;�����@��s����):N� ��@Y>�RLQ����}��k��� �������-��'���F���`p������/�(��bQ����~����,�V��c���bna3`.�*Q��?�r3J-�J6�e��3�\��2�42R�1Qd�Lq���
6R�)#�=���6��kZraI�>0�AS���k�_��#��J����[����v��U�t�H�uc'�H����$�p��\�����,<���r����x�U������r����=�U-��D%���A��,�qKF�#��e�v�N���w"�n�?$�!�Q���A�y��-x����VdA�kW2��"�X����4p�NQq�FYj���L{�����������NU�+4����O}���,E;v�.?`����z�����;�>��`v��,�s�'�:�0�kt�b3!.����/"�������]��S�EF~vKp��ZL5q�E���"Lx@���-�Z���/[��c�������lbF�kA����H�5"���'@������G`��8��f40�U>��D�����FC9���%�f���LpI5���JF������c��+���M����R�Pmp�R����<_�A.@f��N��$}�Cb���Z\���K��I��o�^�?Y��5~\L�n����������CN�%��
��;��0�H��yX��"B@TEGt����^L�E��VW���-X�����58�����Ww�+(O��P"j�y+���}��N�������#	P�D\P�K�!B�U5�>�U(��oq���:
������=%�Jh{1���������?R�'����,�7���MO~�u��J���8��{��	S�)h����0V*�HH��~��������YOe�t�	]�P��F��X��
���E|����4��7R{Zm�=K������ ��B!���L�VB�-��wj���t�����-�c�t��W4���>�
�
�Of/v�����m�D���o3n�_!i���5>�`�DM?�[Fc�?=�B�����s����F��X����l���	��Cs-���O^_�����k�m��v[�X+����x�p�.$�c�X��@�=t@��?Px����Je6�����$��bFe��1�>5�qsOD���KP��u=�`CX�j�z�NX�'fC������X�pH�0�#�����n�;8�5�\:Q��<_g�~�)�X%��o��YL����_E4c��)��L����������g����l
yv%VF*������Ses��/��U-����{#�o&�������������Y{V�Sb0_�Z$x�Z��K�!Ei�m���x�B�`�3�9��������g-	�5���.�C|�,
Y~$�+I�=�D
���B5G��M]���C��8S���J��Nq`$t�k��6#@C��i/�
����:E6��
n��a��1f]�d�G�z0��t��i�M*��R��������Jo��du&�=���@��
��@G��|�`2�Q5���}*
&�	�/�L��V/X���g2�+�%�'@5�h�d����9	-WM�]�0���������a�3�c�����~6V�����O!>��|�����A�bF8Z �xtrJ�`��da��������gh��0�^~�m�1���2���_�8��g1'^�EE�_����������k�����L�gY�����H�k��qj�l��e�w����l+�������5
����s���,B���h�� ��Pd����8Fw�'��|����`�b������p����s�-kb~���H���l-�)*a�N���"Il��7f�j7�~ �|G���o�P��|B�����+n��
?��G^�T	.�p���M��G�7��k^]6�F}���F�����i��S�#���~^<�Hg�l�]\����A���SM��?�wE��j����v������p��4f��G)l_��}��g�6��Q��a�,.��3
�������/C�������g���hu�/r�xoM""�m3�qV��>c�����\s�N(�x�9�Nv�%g['Y:������?��t�� �����3������^v��;P��	T��0�q"�A�!~Sb�_��������5_���}��8�
��q`���%���p	H��|�W�q|����C���KQ/%O=�d*���d�l������L!P�O�C�/����]����H�G���3���=��
��l{�>�5aVy�����7q?(�f�3�	%��l4Z�$��H5N�O��������]�����FT���)u��o��<Vu�TqL(,IZ?�v�������as�7L����2$:SV���_!�g3��_7���+B�#��$����o�8��5�Q����?�F�(�~�nA�����������`H~'k��3B7����v���:�@A�x�*����C�>������������a��&�F�����P�B���u �k*�kMm@2�2�G^+�������^�J�Lj������E��NM99�=RKw���u&A�Z���5�-l�������fY~������B�������]s�|Lp��H�=����H�����@���MGQ��7"���L��3)�m�O�Qnp��tE�a��d������>�K��j����J!T�r���p����*���r�,��|�\1{}��Q���0���u#,^A������Ohl��z���xZ�_�1�������'��z�b.�a;L$�H~p4v�N)��[�d$��wl�"1	�BU���J��;m)\��e��iI�b{e��x����G��)�X���U����S��G`���7L�����+�I�?6��:��&��8��[h7B
���&OHHV���uiP3�%�;�Om�|����
���2�������s�e�j��/c���U��{��W�w�3�����?U�?Z`�ytv6*J=%�	�h[������W�1&xH>�p"	��y��W��$vd#�/6}Y���k��0���1�.�6���jC\'#Y-.��������d�"����J��Y�j�j�	�j=i���mWy��v���U��h4��U�tf;P���-c����o���]�ZkR����-��t��(��J�Z���	�eL{Fq����������+>�<��P����G������z�E�^�s���P�����p��/{0�s��D��
{�j�����H����hL�I�B��2�/�<�:��6���Lfp����\��y��u�|�*�S���@�b)�����E!c��B�oR$��{vDw�Bo)������SYt�2���8�+l�(B�%q��"
1c��B��[x�!G�*�$j���k<��l�����o�9�UP�W@����3��(B������mh��z���(����4���e���4w�L�����J�g ��&�'��@{7�G���$Q�.�b��u�dtZ���� C����*��m]S{
���=N��'7��b�-���q5-9{R����������4b�]��/��iK�h���\n�x���g`Y
}��D�K�o��}���GM��]�;����]���j������Q;�QU��
_�����&3��GJ@
�O��S�E���"�"Z� �__d��qBNE�^|���t���m� =J��}��9R&�z����"��]`�w\a��r]�Z��������_�8�r����_�L�\���X��ji=F����^�Y���d3�������;k��A�����l��/��`�V=h�Y�z�{:���c�yd����h��34����O��	wH�F8�� ���v�W�J��t�j��6�
��G�;��L�S�>@��P`�mG�5w	m�����E�p�
�c�)�����J>�v� �Z��+�������y$*"�~C�p��[N�������>�0Ix��c�o��m;<vE�Jc����������VM����u��#��@���	��W��\nJrx=��l]l[��X`��_�kM�V��*f�C��?{aT��OI���,@�����]k��D�n���$r%������&	�(��|�]39��2�Qn�_L'��#pa%�
��j������Tz^4�!�X��6{�\-����H�����	����G�C��Y���N�<��_���?��2,z��FL[����/��G�$$�W&6�m����K�KM+���1�^�z9T�&�v������NG�������~��?�>�����Es�aQm;��?��l���������|�k�3�������j�M���1����c��0D��*�x�"��[mx��f�W�5���Z�Y�.�������6^~����������72�T�)����������r�!�Om���5`@���9�MO�V�_��.�n\�*��p��,���~r�.Dl6������������e�l>�+>����	H�-�v�)��*�/-P.�4���u��KU�R��������G�*n����)�� �G
?�9)����I�.�Yy��*��+�M_q��}���Az��&�8}XY�K�e~v�`U�����[}���3E� ����J���K�L��#��5B�w�)�>�l��N������\Z���U�����,�r��1;�����3�j��QN��
�+\�F�"a��H���Hc�{�`�k%<��,��������`2��o|��b�<�xSq���F���<������^?��v�������g:��������I��N�Og��=���]j*��JhZ�&����yg2s�@���.�����k.]e�I�;�=���4D����G�����������b����B�����E^�XN(F���z��
I��p�[��<0~!��P�U[��5M�����g���4�(V�����*:p�l���w#��.��W'O�c{V=v����E\��~&	���k��i����*8�����'"Q?{5I{�+0��O���G��Ww�n����#_-�:�� �&
��>,����qL���TI��V���=�7���<��UH�����To��ki�,Onis��YV��ls����3�*`r�{�FV���X��������+4���IV�c!�k1��o��J}�osI�[P����$�"Y�V5f�f�[D:/���"�s�u���[T��
�_��_7�P�w�x��N<7~?g�ak�5�+��v^%��O���JX/"�L�������u��\{5���UR(��L��z��D��.�7C NE>e������Hoy�uQu�p.7�Y�L���g����������/��0M|��
tG1��Cbe{8����������j=a -6
{V�+�J���=���T�q��z,��a�.�(5U��@��JI�$��`u����f�����u�Mg���A+�'U0$%��n�9�:c4��9F���v�na�h=����]�]���*Bq�J���f|�[�wU��e��#����C� =P�\=���M�[��F�S�s�m�%��g�`��U}]�/q�7�������b�GiB�X��Rw�w
���NN����;��/�71�&g����R��k�,��->�R��w�S�\/fy����Z��)4�L�r�
t����vuK5mTHMO.�3��?V��|Di����e���j������
3V���oDR���u���XD�9�0�����4���Sv�9_�>����������K���k��%|��*+�����U�j�J�\�urx=YW��f:CvW8�'����h5],�j��k�-���d����|�W�g����_d���rp�q?��f��v��Ovz���;�4��m�+��:� ��&�D��$"���b�����r�C�8V
�4Q��iT����1
��&��������N[+����?��Am�:�k�_�ko��-�6�3M����8IW��j��M�F�q�t�
f���l=>I�����M�9�	u���y�!�T:��RN
����%,
���Q��m���db�. ���zp�*u	����[�Qr��������0�2��B&���Kr
�~����Nz�9��	7�����)�iw�����9[�y�	*��vb���l�C��}�))9�2��sOLJ�a����	x�����(�,��'����T��x��4�XbM(f�['�^8cdP��h�=T��\��:U���:�f�XW����7��y�-;���$X��.�P�`���~I�N����a��"�bJ�1��ZZ?Q�=����@+#H.]}��db�\\��T�,����O����T������7(�}p�����S��	���^��_,���i�.�c�8n�_	n0J��{�����p�N���vWm������|vi��e�W����R�K��0��'�	L�r����=;������^4�Z��@�\�<=x-�j)���1;0?�)��y���I	��^V�N���ecs�i�,�!������F�������W��;��w^��q6������b�gM�L�fT����N��E�����l��Q".1E��g��$d��
Zn_`�j�������1C�3�����j�]w7
#�����L��b����\����&XZ��v��~�5�C������%��r�k�7n��0����&%c]�FQk�Y��X |��'�����2h�N���^mX���������#��kJ�������Q�D�x����>[
4��i�?v�`�QNK��s���
�(Q��Ky��!_�R�5�*�Wy<�'h�>b�f7b�h�_1!��OM����f�]������b�8V�w�|��)�
�KZ�WOHv�)��+��o�w��Ont;�9�,gT�c_U���s����������	�2,�'p�,T(��.��w����
�����:bk�?�O7���k6v�������s���WNq�&���5M������j.���w��>a\��d��Io��'Hd�.��u(m�3apmTP3����_���r���}\�F�<N�72,�W��'�����A_1� ��V��4���Mw�q���R�52��b]
"���Q������P�H�Hh�A����5���7�Q	M������3%��t�p��n�n)�t��Gh���*,c6��5Z=������C~{��N`���x����t��m�^W+�hA��c�&j'�F�����N57�n�����)��^�i��g�2��c�u�	�.���\�����i�q����uK�p7���Hc�k,6etM�|�F�vkL��o���x�����7��:�z��M����=�s�[��5Y7U0i�=n����U��d�`�ibL�r���!*�+���m��3d��-��)���E$Z��uT�LQ��$����.�.�K�A���7����v\�j���	k���b��$%�sIe���t� ����n��b��BB���5x\��k�m@�'��?�_W26U��bhY����r�����p}�������Y(���%Q>�P�f_wm��N/�Gr�@KtB���An4����4)����L���M���[?9[��z�	Ls�����|�{�s2l/���.����L_���a�lL_�o����������������n���e5������M�/���J��*�8�=^��0d���t�W@��B^��6���E����k����`4��l�.�����F�W�h�j$��)��]�G��91%A�8	���=yO�sqt��'q�m������\@�GR�8�>�EX.�����1���<i�U���Il���������(9L�t���.K�=��[d����a��� (�����gS�6��T:���k	)�::�R�"�@�aG��F-q��<[�Bln�i}��
��)�v���P�b`��FO:uLp/���5rN��[�;�������O*�<��>K4;;
��V�fcL�P��=a�L;8P�a����s���m�m[-���}��A��<�fd���:�����s�:u�p�R�_�Mb�9cT�Z�S0��y�������s��A���-%k�3�v�t�g��uqMzd������$+������?��Ay�����b�Vn
��:"J�y|�	#� �x��I(W���t�����Y*:���F��k�#���p����A�����Ss�����8��	�^��R�L�����?�����M�/���������8����21 ����Z����!��r�
7�����;2�����1PnAaC����A����!(�F�1��P2�`�6���&��rk��'.����Ki!��6����	�����j2��d���phDT��iP��D���T��� �GS�dtIrQ�DhG1!�Ah�����@��i�x
�i)���}������,�l�-�)��[�H��L:��
��t~�~`�����4�B��$��E��O�t�nW
�-�p��XX�
��C81gcK��&\\����������9���������8Ch�`gd���O;��zJo5�[N>"F�P��^S��K"����,�W\�������zg�X>oA	HhrE�n�N��N�&������h������m����s9��Yb,RK6Hd�o���P����V��eg��m�SXD"��������+���t
����=��m��n��;���v��f����cw}�����{��-�yR�'�W?�Ui�hXI�{mD���W{+�	�I(}���)��=���I�!�)�G|v�~���P���k��kx�G��,�rO�j�2I����q�|W�BC\
W��IO��)����X��?�~�������QB8���T����q�~�������0<���m)>��F���lyG�,���%�,z?@������S{��&�Fe��w#�����V��A>�� �3��D�?�b}�����@�P��]�E��
�H�b��L]v&������=�
;�������cxd�z0��],��dq�Dp��HP���q��=2���&=����7���c�h�l�9#]�W�y���L���.w�q�s�5���`�J����'�)0�y1A��������{J,�����z*�^}��Z���|�c�L�t���>����)F�xqK�Qw9{P�,2U����	[�&�{�d��Y��u5�Z�E��}��;W������J�;�v�������{��H�D� �+��DW�������v��W#�,4?�t������6�K6>�%vj���7�0��v�s|���W>{��7]���'��~��2���-\_��1�}��|��=��j�D�b^gL��f2�|�3c >Z�c���(4<���_�p�CCoL�j�L�Z���Hh �S[�_>�W�����TR&�?�@�]w�Ul�]���x5�z��MfV�	4Kg�/�_D(�L~�+M3/�R.���Zy��x���^�Ty�g��Fj�5f ��OI;*,O�E����2*�H`[�1:��}6�@n�`���}}s�����������N]��]���|��u��!�\�j������j�����x���12d�+��nq�)�1����,���t#�(i�n�?��
�uG]�&�@*��2RS��i���K=�6}�7�yc�����L�H�<�#B�j��k�<;������J�X}.���ge�fv��xB�]%z��7��o��_�o�{����1�f�`�Q�n�� kvr�2Z��X����aIL=��fU����5
~	 �q��.�B�B(g�;�`z��+�+���G���4�����|��h����m
J��R���t�Ay4�������(�U�;]��ao1
s�>nY�l�k��8n�/���"����B�����Fo.�����L���")�%����'O�G(5�������V1�u�Q��;_��g�h�=�y���Z��yu�Y+�>������ty?p�ou�7���x��x�^���;A6-�H�	�kp���;��������o�������X�(�{��Q���j��$��L:���=�,0��������j����\������e"� 0�+�������Z'�'O�������2�����v�����s�p'{�������fv���v��|�HA#ys&!��7'��8T2��������
��(�t~�<z;��-W��|��F�G��)�Ci�S,�,�:�R�D�!���BIg�;_��D!����K��9,����`��G����U[������A<d-K����[��vi3�*M��V��<��
��!z��<v!�a�>��������D��.#+$����J�����\w���d,�6+f���g�*C��Y�q����b��|&
�Q-K��~O���:��JqT���u�
iE������?�-�����P�S��.�;��:���������H��$��� >�&~�Bb��\nl��i��Z��
=�j����$A���OuFr��v���)��W�Z�#�1�������)��c��l��}���x��w�������J�#����C,�#��-����w���BX
�q?�������vi�0<����Zf�~��`���o��xv�J��b�J��)EF�?�\_�k����)��x6�(y��e4'k�s��@2UMv������v=5���2�F�j�e]'U�+s=���!Q�������W����I�'Q)JT9�|v����0K��[���$��������X;������i��f�7�CM��*$�V�D��$�����Fke�CF�b���)t����'�G�j���	��n�CR����&~�@���;�r����	������z�5�_��^�&�Be����Sl�K�������������<��I"�%���D!1e� AR�� io:��?�xn|MN����<���"��^[�O�S���6R$���
��9�;6*���k����w>�3��3c�aK��o�}q���0�D�4���gf�Fr�;�Np��RtNdhg-<���\UXM����\0U��:�zG�^�Q��T]H�X�N5B[�Y����=��
~���M�.e,��M��8�j������!���p$5�D�������>����K�a}�PS�S��8�_��(c�������Y�=��g!vE����>OiH�[~�3j��)�?pJ�W P<"	��3a��?���c}�N���Sq���U'exj�x�;	�p`�?�1#���{�[���6�+�/p�#��N���(F��%�t�/�5V���Bs�(�����
����_���=�vlH�!lFZ�{ 2Cq d���L�1�n4y�&��������wj�t6�z2v�b����d���5�H�e����7.�8(���������m�x�Jq�G�����p�����������%����(CA��p�z��N}���9_F���/���k�@����>���b��s]6���1j[_�Pj=n�}$��2�]�s/���C#��`��0W�������u���/(�L(�S�uI�n���o��M��;��r������9��\�����'������W�0}58[~�=���g��prVC�5A~���Z�~r����k������>��9���v�������Q���x8t��EnT���y��P���%qIwx�1D������{K�F�
�)�<1�gCn��8���DQ-6���#DB`k]j�p8���T����II*��A�'�@d��C�������X.=C�H�s������m��e~��*/���:�=�c�;���K��~3�)�RF���e��$?�7!]�����If�E�7��x�������[����a3#��H�n�`����g�!��gQ&�4�!�F����/���t�
��<�����{��9�M���\D����W�����F+I�,�N����A�75���,�me����+�kw�,�Iy���5���1�f�+�VX
�K���(��g��Yy�R���`0f�%H�^x�g�AR
���$;�O�1n�L�U?�N�[�C�8�q��
��hU0�]�i�]�Ct�>F$$���0��E.��&�����(��
�\i������7��7��l��������HN�P��?��T������'��c�QO���z�M�7=Q��_A��U��\%�Zf��pY�W)'A���������a��d7EK?6@�;���
��6���W X)�Rmy�he���������>���S�x)���:	Jv������y8jJX�y�����#����H���a�BW������3�;���]���5����53�1�"s#h����������/����������	�������D��fB#{��=v���N�}����\�������4����8��3������/5��)����[�|`G>��lO���W�����g��M�*�V3ge�������Is��f@_�Ayy�� �\���Y|��J�|r���h�4�8�ci-����{�4{����$$�De��O�d}g���2	K��u��0�o]ph�����/
�_p%��w����Y�_��:&����:�������>��kW��f���?������E��	�E��7��9z��^)W��#��]�����*��|�`�������w�s���m3�BB�X*<���>%��:�$ ��\/�����k�_��X����f�&w3:W]���~����m�_��i����.q6x_�J�E�����XU�����U�.2��Y��	��W�g�}����h
����4!z�{��m���si�	r�2/�k���~��#+����K�%d���_�{���7�	dY�&�$���<�D��?��6�i�~���7�����Y�����Dj�6�3pb�7,�V|��z�v|�^oC�5X���X�*�"+$8���Y�G�T.��.9Ma4�:uG?W/d%��������v���yK�h��8]�uX
��
b�t��=�6�
uz�b�������
�������c]�V����n�y=�)"
�4h&ZW�=�tm�By5)����8�a+.�����1��P}�������M�S����6��u!���������.��G'k�dL��M���N�7������dL�@w�q�i�!u��K��Z�l��5�������6EL6z����#���C�k�)���Z/i�2/`��]I��|bb�&���*�����y��}��G�������Tm0������#�_o�M��gI�&anlQ2���Qc��0�������s����HHr;��Kg����L|<'	a��N�V�������!���4�����N�2W.�=���9D/��* p��%	����-�%h�!�9{��jw]�@>��q�2��q}�,��.�����)+t]Px6���y:��:��`����"g��\�����h�0?��)�~��qi��6"Z��8����gUp�f�YsY�+�b����&^u�"�O�h���q����z��|���I���d�n�!��Q�q�2�Q���q4�K�"'1��c�N<�5B3�lI��[T(����;
�NYu�n�G����
5G-�F��?A���s.������8��,�OI����%�/�����mbN�j{�C��)|��)j���WL�o�/��3��6,�h�Z>���\xw��6g����.�7��s�n��6�L����|h��:�����D�P�w2��h07�p��koa3�o�A- ���B�k&l��6��"�x��	��+6�����PV7�
�7p��:�y�V����=���.��.�H���6{B0S���r��w1,����C�#u���q;�nU6� c��Y�v���_1�&�-RP�o�����\��s�_J�H\a��<>�m��[�H����u����[��0BWZ�l�����Id0c2t�H��	���E�o�������6��g��?^	�9M{G�,k�mY�M�����Cb������g��u��������cn����_JC&?2nIn��M��	\���GU�O�iH��/L����7��U����H9��5�j�R�d^�k�(r���f7�����e�W`��ZeF���dIY�</�g���[%��tf���u������8���;Fd��D��6���}��6�J����������������3���F&c��Q�{_Z�k�YI�kQ������! a��b��#K��A�E��s�:��}�kU}S��I�����(��{9�����$]i�q��2��C����c�m���#Z&6<]��Z�G_��=�"�\�I���e�n�����6t�|�����)�>�]�j�����X�~��I�, Z�NA�#�+��fF�R��	�r��v�
�j��t�PH����-�1�{����������c|F�M,Z����1��&S��JE�]����&��Ic����K����>X8�tU\���_�#�E�P�<IT��}����1x�3�F
��'�������S�uU��V:�4�{j����>��<��4�	U��q�LNDbW����Z���mw�_p��������,��T{XP��5�7M��>V���|i��m�+�g�����0&)VBtS'����ST�rL��u�Xm���jeH�T�c�!pK��d)��F����r.���H�	$����h�T�������z}{�8��?�V|��^aV����*'��%pO����bLK:�R`r�M���P�M�D�C����l��|�W����N>�����{��W�y��_�}��l��h)�`�D*�����s���2�	�92._l��U�.J����IV���/��!�wlD%�d�wI����V��`��t�c�m3;q��Z�3YC���p1�U�u������&��oZ9�=�!�7�N�������������rp��J��N��X����
�g�X)�;�KY�
�b�n�QZ������hIRD/1�;;����m��	Z��7%p��Qe��p��2&U�ow�y��RF��|I�LL�S	}o*T5�~����<��*v]�����H��j>-	�~j�y
x�z�}?��Pc_.�v���Q<'}�a~���K���*�wW=9w����@����~9\�8�~�B��cfE2���e��H�3����YU~UH�����'�W�:�+-���������~��U!x��3��N�[?b�i�r0��lde0Qcu�k
d0������b.j�p~���q6�m�<\M��m������f���(�'�~��x��{Rg�rx�0�6�'I6����i���6�P�l������(��@|qp�aZ��?�i���)��fqz�$���vl�P��$��w�"apU�T����}t���}���<^�_���FO�����Oz�����t5����2$�{���8�,UpqD?��o6�(T�y�����^ u�(����e	"t�5y_���
?;Y��<����|L��=aD@V&ui���8��)�9���"������705���j�	E����\�y����}X�>i���� ��)���	D��B����+c�	L*��9���7'P�F0�~�F�;BpBz��B>H��Vv���O��^B��� {v69�+S|�����H�v9��h�'\U�i���m��8�������sK��')/'66q�NJ�Z�_S���������g���R�u}hp	>k���7*�Miw[r��m�e��`K�!�IK�H&� ���TC����������Z�h�^b����f�����j��h^��������=�6]���>
����\z�J��6�$���:����i��f��!�oJ����I����#��D���3ugW[X�q�}�A5��Z�O����}���;�jf:�_�I[R�D���~1����lv�/��ky���P��S��:_,(ug���E������U����5p�0��x�?�r���]��,(M����vf���1�.W0��G���D���}�=�8��|W�-�QM�2�w�e�dY�Ki,��a����j3�<+"��e�t���jL)�3V��Lv�v
=y��Y�?OVc,���M.0y����(��Zv���8@������@�:u���_`�Ya�^*�?���L���\��h�G�����!TX�D�(�����3^���i�G�E���b
��eX�a�����'�\��f�N������J��i���I�{�����ZEQ.�F���7�x����>|�i�Tp+k����;kzD#�W�Xlx����
�2��T3H��.1,��vR	���2���Lp���������|N����~��0����h��M��D4S2�bln�;�z�>�E�:06�fY�����
�r�!��I��j��^z��<]v	��I�R$c j�b�7���8��G^�M������Bg�'���X�H�}6�������,�})��Xf"����w�?�ax�;F-��S�\���
3�F+mp���-��W�p�&z�?c�[�`d�=�:���!�o���������->��	���A��������AB|d��cJ
*&��s�[��������E���t,=!��}�1����Z_i���0�t�/P/ey������F����gm`���MDz���Om� �������l�����z.x����K'XM�������X��?�1��Fr����0�I�w�H}�i��B���zB�
��&T��,���]�O��B��m�m�P�(��Un�����&��N�B�4FM���I-;��U�
�t��a������Q�K���w��B�v��#��Z�J�l�6&
��������a�y�����I`P������*
���f��{��f������O�{% �wo���BXi�b���I�fS!Vq��2Zwd��j���I%����~��^��m^;��8���\�v��'�%����f�$����s�����k��NZ�����##�o��������(nm����Z���Z%��*�
��$����it�b�p�n#-�j��V5�%���-�{����k)j�h'8z�� ����ZtUH�w�%�)m5�i�]����E1,�WV�Vwaw�3Z).�Z
,�w���1�j�kt���������>�VT�����{��QT�X#��������21��{�U����������1���6���V?{!Vy5����b\�
j�*�����x|)�J����t��X�S�K��_%U������a��k�^����6�%��V�+f-^0A����<�*;��.v�0:G���90�x��O}��~m:G ���Zz�A^t_��@8N
@�s���>����� <�S�J�^	�b��KV�0�.�9��lG24,<h��`rN0��0tU�����}���H"i'����lj0���\6f%Z��~���1aRa.�"+�a��� ��P�������\�>��c0;6����q`8�6��Y����"�v�O�z
L}S��
c��GE���]���$���/��
�7.����j<GS����#�v�:������x��}��9-`�&��H�{�����Y���?@��1�#���H����L#����1\��#����U�Z����3�}��������\��5T��sM|����}#�&:���c#�����I�>�a7�{��Xf��������v�����O�3����q�d��|�y:�Ww�\%������1���n��e~��I(}���Dm���q�ne�G�_:��?h���.������m�s�S����,�`��Y���*��P�#l����(Z
}������,�}�*�
������������o2y�b��!p���y]�����W4���3��J�UM�!|��M1Q������FX���0|�t��q���W<M�We�1�b������x\G��7~����<����s��i�-.����kX�
;m
���Q$�S	�j�iO����s
v�����ih��'�"&\Y�N,�s*�~#��,�p��I/����8���r����D	����:�2��z��Q������E�/(���a5�[��8�C�%-�X�J������o^�{ ��l��jK`c��M�Y�[�mE������OW��G!�7{N�$��
�Kv�4:e���=�7���7m������Sp2��-����I�jA@R�c�~-
S#1�9�-�wq�I���^����3_����v���pa�#��I�.zs����������)�5h�rW��X~���5��H�]�7N�������l�����m'����<����q�����sr�������ANb���������d����xQ���f5��-m��+���K"���R-�m?Jym����gOq�m:�J�\&$*yPKYw�
C�p&S�R1�
F����F�]���	Q��?�S<����@��x��FN[���^���S��?��)S���q[���������t�
�~+|���_O��1t%d���@��!����_��g��G�����6%r�A��r��e���{_ ����5��@P\�6�h��b��1O��H��J�R�����*��=mX�/-���|��|������X1Z=������{J�����#�ld�!p%��\��-������"=���T�m�9&�!�/����K��]�*��g�*���0?�����k�>%�}�Q�sB8�Q�.�yo'�E���uF+������>�������(8�<A1���{ �R�*	��<������|
����Y�5�RN_����i�=��|��$ ���$'��\z}��G��}��d[�>��[�-���U�n��]��ox�K9����s3]����SW~���C�b�x�A"�$�%�@X�{������wA�Ge�e���P���Wo�SP������
�0�d.�������
�t�2{69��9dR�9c4He�����n����\�E�J`��%���z@,��
��/-�:��k��'��be������~aH�cN�v*K���@P=��C?��+��O��X4����u9�8%�5{�}5�s�~OD�\OP�{�z\�g�se��\.�a��f�W�b�8������G'�7c������o����\���?�%�R�����8�i�OFBS�|N�M>�{3.5��%�O�J��
����~i"��>{����������P��3&������v0�����e�v,�U��� h�`���
c_�C��Qx���
�_���R�������A�{I�r��� |����s.n�������z�5�{����23g�M�K�6�1�����,.Z��F"��6���%���n�����/��'>Pc��k��C����%�R�)R���S��+MnnP��������.�~@�z�U�]���]��b����{j��+$����#En�s#E\�r&x��OMB�J��6��i�F[��{��0*��bQ@F��c��H*��T{�M��nZ�AY�}�!y�L�NM�qRH/�y��5�}A��]���������}fL�m8��S���{hX��o��
y����K�O���'��PZ;��L�����J��?������� �Dj,�S#������l�#oy��@�t3� ��C�#d�>W;e�J_����BP��s�/�sp���&��;��e4��1jMPC������_�K������l��wC����
%J	�&����v("���
�7��1�#,V���[/
R����r|3�����'�_Q|8u	�x����j��v��A�~���@�s8t$.'��<�A�o��;���V6���klo6���=�}b��2���2��<�r:Iq�k��$�O
��"��^-��9�T��K8
��_����C�;���4�	@	Z��w��{�_o�Zu����Rq	��uF�����@����,{��0����������|��g1�q���"\w�RK�$�����X��5�JL[+����Js2k����H5R,�d�;��M1p�#�^S5��W:q���b��@\��}8�k���6�����>�8�0s5m�����b& ��e�s�jEW���ct��:�dg�o��n�������+	�^���������kY�����b&]b��?�!�9�_
�+[���������,7�pHw��H�d�6LQj,BC+���*���Yg�Mvc�w�.����
/*��=�t2.��34����3�.�T��P�n]:Z��
�sG�lB��,��Sp ��l�>F�o�����E������9�|���Tkn*C���)*��p�Y
<��*'cv�>5��,W_�$�2�r]G;��O^����~��|�4�������L�n���:���@>�]�%������AD�	:�	~�>���c��k|.��*���}���VJ>4���(���=�a9���PO^��Ng��������`0�S�L�\RN'�d�Gl��G~�O4"N.�<}u_��'@U��v����v��`�R���z(��OBM!��9�D�H�w�Y,J�j�O~i9V�S���L�_$������*���3eK"�����*�*�|��[��G���|����3��g-���1jZ,��2agP���1s�f�$)m���k>���o�U��g_��h�]f�D��>F^�Cds��&��$3�W������C�%\������5*~�FBu�J��������7��2+:`6�9�c�����:}�e�����Z�1�q7���I��������s����z��0�R���� ���@����� �!����uu@�n���1O�?i���:%��(\�K�Cu��QwU���qwLGW�E����~�6�p���&@g���cV2s1�K0>u�B��fg�.�R�Q���.�����V��~M����M��}�ZY:�F���2���w�h�x�@�����U��F|���C�J�j���������'k�dMl|X��) |�\������������%��
�����N\�u���p�Zz)uD�$�3Q�*�D�2t#P����@�-��8]�}��qe���u�����F��U�M���]���}C�
����S�V>����z���:��W6��N(���@���Iu���Svl�MlB*E�]Q��d�s�*[,npg`�I��	������ WP�d��)�zQR�����Nu
���&���)��*
�}>����k��P@�Jd�G�h�Ute\���v�/5���������Q)f�������*?{����Y�������
���0���cG@�&Mm�uJS�BXN%�s[�LK��:%c�bc�f��r��#4�{j��>������i9(��X��������������ce���`�|[tL�x�j����`�a�1D�o*<�I��bUd's���
��������.�3� R9���=aR���H8d���GY�SoK�"����7�5����0�����.Nt�,>�"g�g�VdA|2���i��������I���M�vT�`l-8�������~��y,4L�X������h?�d�0��EQJo]�LG3��ef���!"8�S����a�h!����c�<�-}���)b�'[p�k1��XZ6���H���bFj@�cB��Z~\8���k��O"5�V*��&��6��G���ea�^di�eO�2���j��T<����U1F;$W�}���M#�e��%f���k��v�W���)����V�n���-i�Ya��+�TL��c��L^�S��2l�d$q������3�?�[Y�H��/���s�8qC����"P�*�����*���2@:��v�����%'-;�#����*�pB�DNrB
���s���d&B~a3�~����q��9?���9�7���_��@�8I,������������5"��$���y&�?������ ��#1�Gb�7�P�,o_r9y�]�d$z
!Z��c,Gy��)��n���9��)�_+��@e�u�
-�\;���UjI��f5u�:�4
���*^��:[�5
�
O�?������a2FZ��r�����s/g8~|�Q�v�o)���$*����[4-��n��*f_z��C5_iS��!� ����5b�H�r�+���_cFn2�t.!���'eo�ZG��
j�J}w�\l��>6K@����J���
Ko��U/p}�hZ�,^����jb�}��gj�h�Y|��l��te�������fv�&p�H��.�q��&�8;>bo�/:3zj��~m�{Ub��9�>��x���3������f��P����:�zl����MU��F���a������-*�b:���.�Pc���w=�����I0�(!#?�����d�/�5�����sL'��H��ncLf��e�����(�\uv��ZrT!�4��.��k�~B���7���z���B�x$�I�&UB����GSW�FuOv�������|_��O8�����RF���{�%�F�B��>c=N���N��.�)�"���}i����
>1���2���+5XB��v�H.;|���t�����h8�T��q��0K��yro��L$���u���-V�9J�B]m��o*a5m�d���\�h�7�B;cwU����	c�Y�5��tY��y&��?���#�U�_:�(�RC��t�R�+��/2���DG��`��TA��� fgF rCI�O���f�.gkD�/��?�5X�5��pO����C3;���>��n���!��_�U���f��Z�KB]���p��`����n�J[�C�����gX$����0��)+7�����U��JH������OA
��[
�k�v(��Mp���vQ-�6����L�r����<q������W^���+pEb���vtN����O-����.�IY/hw�~������v�Uq�s/��-����oA��� HW���c����`����I�0�wq�p��G�]b�}#�r�����z����~���,b��u��z�p7��m1��{zf=.�\,��D�`Y��-�P��,>�!1m��?���C7-��yQ�����m���\9�}-�$���� r�\������&��������T��7���/:�q���jg�yg�xY_v��G�+��0�>�\_�L\
!���0de��������`D�lG�e��Cb�@f����+?%`��T�?�aS�j�k�c��lm*>c��A7]|���*������������U�<S�������q�So�9A�|�W��,�9�5Pi�v�Y���������� %����.kNr����ja�Hen;��]����y��$�����y��7�<.p%Mo>���?I��m��3:E;�����s�C�G�@`X�6&o���n�sR�\��Dp^#aU��<%dL����X���y���O�t���^���^+(eu}���b��>��D�b�����8��h��B�q��zo� �����1z^�.��N����W���(��s�aO�_�(��C�*�J3�Z��?5]9z9���Ss��JL0���v�8�PCB���|����A�&i>2z�V����#Z�D)v�����d��
�@�e	=�B��NP����-L=
�����LO?������������-Ih�)�d���a��Qjb2��x���;�gh�/���3�_�����g�y�Uty��~[�F�P�K=_@2�_��o�1ZZ�E����C��Y�P.Y�����������Z��eV72��e���jZ��;6����L�^�����Bk�������m����������>�A�$Nd��av�.W�Rn��������m|�*E�������2�C�m�.��- �\�x
�1����(����p���9�;�^��u���l�+�]�A6� �h�n��*����T`����@UWy�i5!�]�������z���]VpMR��U'qR�����cd9���+iq�5�������;k�|�XiZ�L�P�F��Ix�$�@I}�?%��j�-�����R�:���(k�f]-�[~/�z�������zo�$q��]#0����N���`�tx�����:V!H[�e�u�9������?�a}_��k#��{�����8��)�1�:_1�R���c����SU��3�G7#�S���1J+7t���
W��:�2itv���Pf���&�~i����@�h�����`Z81�wM��a�����B�|=i�%dB�m���:���q�XX
����:Cm�}��
3l+�k�����G�K�M�7,�1�q���^|V�+����3�$ ��;y���+�~�h:���`��9�qr��'��!@����'S������9����k^��Y
�>�v���f��8�b;>ZLXp�1�z�-�~qyuw���[�����d��Fw���\����\Q���~��ln���Pki����P�1���a���c��J������������m��C�c?���3+Gf"�����[���O�5���$m���]8�(V#���E���m�,�L��
�����W_��)�Vvw/0����s�������=W�K�G� ����u�b�������E���q�r"�����(���K@�~���l]��n�����<��M\�iD��f�hD�R�:�1])�����-r�~�������}��@�Mh_�V���<�O����GZkq���<i��������t��Bz���w'D�C6�T��ht
R����;���iI��=
����S�����F85�:E�F����m�T��KC���aW�CP9�k��ko��k�h�]u^���o����>r���n��%�VM�
���$��R�i�Dy���B�j:����CvF:}�|���%M[B�c��#��$�P
���R#m��A��x����q���i�w�������${�ujqOy�!&��9���u�~p�����N3����WK(������y>�����#H�%T���|c-k��qu�h�]����j=e���}���R��K57�sAqo�����H�����+�6� �P���������{�,{���'�
��k�{��w{���W,��u~D��Od;8��!t1�N]��]�)�QxkTkC�H)�p�n��=��Kbg�N�T�P]j����~D����]��!���7��X
zm�(�j?��dup�M��$�x��������L��\��u�ja�H�|����+���S/3�A�j��L�������Rq������7�,<����)J����}x�.O6��h�8�P���*���2*y�U�7��	l���m�M@���
~�+7��:U������w��t*n-$qq�G+E��M�uLD���N�o}`�G�k)�y��N��:�G;��]��XY�S��A^����������[�)w�����Xev��1���m����F�;�/�~B)AW���-�{
�v�F��v\E�8��4��'�H����W7�������%&��E$��m�K�f��
X)��byg�>N�C�������-�X�z��l�c�l9��<���������'T������+��z���:�Sv~�����+���e~yr����#�+�.O���i��l*�ke�]J���*����bbx�����. ��co����|�K�a�UK���j�ZA>�.������If�8*��tY]����|����`��q��D<��5���f�Myl�����<)�\�v5����w�D5�|��s��t[�QV^��kJ����`h����]�U���r8��UnaN��=%R�����nc�kG�?w[�Z��v����8�o.�!��Lf	]peb�����:�.8�	"s��f�����]�x%���������9�
�{��(������7?�H������ci�����%.p�����U7`��G�;z
���o�&������;�-"�A����6�8r,Qn�/���2S�$�&l_T�������H�6�Ne��e�T�$���'�L�2H��������	
[��]��N�a?k��=6�W�=�\����IXo����(�Q.�,�������EOm�!{��b����B��r9��T�L�_M�>�����0^M��w��c�R��4(�V��#N�����{��(`��O5����.�i�7`xq�b�f8�n0Yx
#Z/���P�����`Z�D�3`w��d�����P���D��bb�����h��fk1�.�1n/��:9>�,Q�Vx������s9�^�&?��r>��t�{�E[��m�"~D?A���e��E����x���f��DI��a�A#���[�C��a���K�\-_�Z�j�������c����b��l����-��_SXo�
@T���i����W���U�^O��.��(���we���_����X�D��f�R�6rR<���;�qt(U�Mus��s�����\V�^3�;����\�d��v�v	7����}Um/�R���������Wfj/��x������T�y�H1���y���o.����������������_��jnV��d�����\����L��|/U�V@��{����dj��-��Lz�f���b������m�\��e&����������AA���.�"N�a1��9J�a��s���/4���w������SF�q����N����������[����+K������hE������#p�����P��8}�R��Z��!�����fV�.��;����Gk�_ ��T|�h�%���.,�v~�FU?_��l�#���,������*	���X�����5�����{�d3K��Z�
��
�����1������+������������KT�.�,�'x�HK�5�gj��Oh���O���;���2��&�M�c�xjh��X�x�)I��������x��F���o���o��*mM�,�"��r���X1����V��>,�
���Rl�}��������������L��*6�������r/:�]��M��j�!���G����u���2�FA=
o��.|�M��0c����"i
]����%
���'|QW��6�.�DT�����Qt��{L+��fD�H*M��n{�w�pW?�������n��/z����xT���|N��_<�*E���U�%��.Pr�(#�Bi��?@�o���]"�a�7���61�B��)��C$J{_�#��P�8�\���8kz����X�)X�������3�����g�A���� ����r���+�|,H�>��L~_I��u�n�/���v���J]#� �|I���\�����b������^.�4z���C�v���~#[R[s���i��y��x�Y�8�����8�*B���F�]�h�[��C�������j���A���N�������%�BM����v�-��>�.�7>����J��?���2��S�ul:\����i�6�u�YIV��py�m��a�0��pw�����1���f���h��,n����I�{�ht����{~W������k�z��it�����-w�S)��u�\' ��������[�hM��P��{=��R,������m�6>w|�2�j�i�yc�t����������jo���,�.K�PY�G�b�J7�]�������^�Nl��-��~r���9����j�����Y�3s�N^-��� OK�>���><E	�}u�E�����}�a�VG�NbgO�Cl�����4�����%��)�A�Z<sC��z;�f���U�d�Z�lQ��I���U8u�!�A\'0x�/���@|����~B�qg�^��t9�@�u��|D��Cd��w�*�h)�QB��2��<��������<W�����_}��/�w������}����cO����`o� �CI���'�#�@��}qa��`��+q�8�iq������
~�^m:m~��`����.{���]�v�gp�����~E��8}��RJYV��(BT�O���sG<�U)v?7��~�XcG����7^Wv9v^1}�[L���OnQ���������>7��Fu���!~���u�}�(�%s��`��M�A<p���@QU��'���R�Q��8�SF��N��mf1��l�u87��pK������JY��n��,xnk	��el������G1���@}�0��Y��R?�N!`��!�����o�7���F��7�������"�w������[��Q����������l��,����t�������~a���_L��9�~���j_��h�Y�t������W���Y��\�AAN�Zb,��AT7����%+�p�Z�{�S������$����y
��8����c������.�Kq[�����B��3�����O5mu
Q��\N�?�����o
*�P�P��:��o�'�:te9����7p�E�R�hW���<�{��/0�/����6yv����w�Pi�ps����
��IYV���j!h�u��:|�#���P���+�m�hQl7H}�,���W�kn��|������3������Q�e��5"}���T	�b�h��m���U{1��t��X��PR��q��UG�����7`[9���hVN.���?ue��^5M�^\[�x��u7�ypsGA�&�=���+��p�a�sD���������6���n?��p��d��~�"���Wh�|]���J������N��������e0c��+���Wo7yA��h�����x\�P�jj���OR�@��b����+s�������r}�����k��:�TD�����:o�GS�s�~k�9��L�U4)?��]��[��	�%����j�f��*A�����O2\1W��?I�)]]mn�R�
�!<}��S��{�$��S��M�YmP�S�O$k���F�j�K����^X�@i.��H�9��HQ��TN���-N�4�i��v��������
e5mbh����7*��X�p���J���`Q����*����/�����7<��,�v4���	�o4���c�iRI��g���{voSW����Tp��b��c��}
��=�������q���2iZ<.�����}�x��C.��E����(s�Ob(6`�^�:g%�Y��|6�!������Z6�9������R�7�h��}a1pY��������X�,��l���(W�y'���*5��z�p�����E	{��9&O�����!���&��v5���z/	��6���n.s�n�'y�|l���=�.�9:�9��9��_\9����~�F������P�s4Cz$v\�8�[���3��#�3$5��U��0v �p�K�x 7�_����a�����������
���$�mw:W;��S��Q��H���z��Q����u�>�����$���Dk�����Z`��i��wyH��_����p������n�~�4+�u3�9V8;nf;�=����!r���~�Y"�^�s�[�s|`������=|��M��������.4Q��R1�*��|��X��3��8��w���%���^��
Lw��!�	oi=�{i� 
s��������S"�������M�Z�������ZvB;�{�?������������]Ad�Y�Q�S���T8���KK�
��Q5���hY������;��.����-�&�q�3�\����(_C���7������(%���2��V3N���+R1L����K���*H5����xk�����Y�����;�_j�����`��P����wFi�[�!����;���F�����UQ2��1Fs�:%����F1���X������G�uS<Sw����~�tOzS>���v[�t/=���w�0��}d6��*�#Pu`��1���K\��Zk�pot�J>����0���n��^M���#��
�a�nk����{����?g���/���0�O����{�&��yfmf����nk�������=8��V�N����Z����o�U��`���p;r���{���,�����d���q}X?a��nur5sU���)X
%�?����7Hpf#���)}�%
��"
T�X��~C��a5������o�rr��y�VuE�[�PD
��p3vhD�Yrn�BS���8���x_=��~�E�I������`����3O��|��Ym����4Sb~HV���DK���������PR"��K����i+���$��Mo���������r���_���j~l&����r^s��g:~��h$W}C+��V������d�\��2�wE��������E^�E��uu���Fw��s'
�^'������i�b�S���t7T��r��%WK��v��F��cSX4�e����~���#[s[���?	��7���������e�1������?�S�����}�WF����m~���;BmLll
D��|S7W�r�������"������h�_nul�|�l���P7_�A�d`P����;�M�q(����������������n�M��
�q��.��R���D�:������n��38��"]m�0�|�J�xjN/�~�!?t������#���H��G@����5���Q(��~�F�oK�#���3���\7����z�]P�W�J�#O�1�V"D{d�����M���������u-mdz���B�^;�j�Y��K��^����*?%�X��
���I���=��)_�����aw��j����(����)
2[���X%�����i�����W�4����;�T[E:2�r��`�B��r�K�t-��$E�U_�0�\����?R�f���n���|�UC��^�\��+w&��55!?9a_r��2{��JRL�����j��������+H5{���\�a'K�"BY����P����^X��
+�)[Dzi(����i�l;���h�r����X��*�E{���e�d6�!_H�"��A��'��QO�mA�5��}D@C�����!�wS�os=$��������d�6����Ps}�B���� _y����[B�>�}��du����`�����EK������1��E��C�}��c�n�������CD�qI���w��D��Q{���%�-��~[�34aF���q~]�H��F��^�v�eYF/]������xj����<$�}QE�����~����rmJ���_����h�Y9-�������E��k�!]T'���"�2m��%+��c/�&{nF��F�6
����2����4�j�����~��	"�����/h��7:��T�Gk�k=��������c����F�-U�6/2$.��m�������5r#\F(�K��c����������f��'Q^���s�;X<��@>��$����e���.����X��hY:`z�i~�lpg�J�y��v�Ps�'���=%����i��X�H���hcN�o�����i�T/!��|�p,d��7���'�P�-�z�������Y�)��N�"�I5����r��S��n����C��r���iM�=��h�(Z����0.lU��mW��c�����T�����>8�j7s�V-���3��e<um���s]?�q��(j�X}$|�+ag\��
�U1w!�G��b2����0�������JO����C<ZJ���e��F:�,�*arA��e���_[��,
��}|�y �q���<��vj%�J�l�*(��H���xiX����kW_6s�������In.�b�\:�����>1��l������B�Hms�)��~N1�y��LT}���>Ah&�0S�C�U��_�(6���7�tM9j�Z��V��	�W=���iU����D5�%��
�,��h�*��	��x��G��'u�lfU�v+M��t���?X���4����B.�t��-%����v\��U}6��+�����q���k���fj�:i�������Q;�E������<�>�C�CU��F`S���\�'�"�E�����D5h��������T�.\q(f�{�02�pE2�����t���10���%��8�#e^>�k&9B�N��l�h�(��)6f`�&�}�!36��A>2\<���2���*Ze>,����w�Jo
��3������n�(��B�R�H�{H>���4�7j%�:��fnP�� �B�]l�1	����P�����!h��'��Jry���[a���������	�<�1w��U�����_���=��i�('�V\�K^�#�o�c�
����cop��M:�lO��4����&#k��yN���/����e��W�4���R��GI�0�������.�U�>�j��&*&�TD9����\F�9�uW�e�V�,I)����SU��0���dk/��_d����k�D5�m���i���,i�>{S�}��b�vf��?H|�����y�=}t3I����wf�U�`����)����I�87��]��U�]C���W�G~��-����J���
�����X	����X������k��%8��p@&�a{3�*E�����4���������9'@��m�n��oV����*T�	;j��[�=�}?^���5�R���b�R(g�vE��O�����c[�"/c����e�v����n�f�rP�JO�&��E1s���\�p<���M?
��o������Z���������Q����YB����-�4���!�@o���n��|��b���JAh`c��eT���Y?{n�Px|�cV�G�|����l���s�F��0����^���Kd6�TL_!�P��l��+��G=�>����
dl	8fx��l�'��{L+�������7�D�H��#El"�t���!F��J��`v��Ik������{��Y�Y���	�U�cl��9s�����,�sD��������ulQ�R���R5<v^x��8W��^���U�R^A��j�7��JI��W�Hg4�`C<JOm���X<f�5��^��)�.X�gO�Au\}Y����0I�\����l0@W^z�����B���P�ys�X����)q�H�H��B�VWUi{?\�Y����z���n�]�0�g3r��%��rsXw�.>����Z����uz��F�ug��'�?I�d�QY�m>��oH,z��c�_��`v�?I���O���mpG$���c�j9������c��(�]|��p�p��]� D�B�q�?���@,�[S��������A,C��k2E�l�+���S��2l�H'�������r��u��
�J��~�m�L�I�0����/9�Ug�'�&yZ������[����st��*�nT�MsF�W�)����� ���K�E������j8*l��c�<������HP�E��<�*��n���� �+`4e+l���V�rWd[.�>�����y�;d����K6�\��mKv�'���w��'S��e"���8�o��|�BtvW:U��=�$���(9�xP�j��b6�-Ce�V#��������TI���
.��>.5�a�1X�i��%�����D���o�wL�Lc^����ix��`?��{FZ�[�k����������hk:<��5�����s���#�;�K��Kr�4C6Y�2����vj�U�L�;�.�l�o��d��{?�Dz���N��\���a.*��8m�X����;�*����f����n�s_W�]�	��[�
l�������F}�2�o2�"��!=�l	$�]���)cR����	�'�������\��#�P�����cwg�r�����T���6�,K~����0���U��&�Gb8�����}�����jl�v�����r��|t��3C����?���?.T�Xb|���K!��P1��X�i��(��X�)��k�r���!�����YL���g�u���z��7P���P�R[*J�Jj�����d����9z�R�7���0��4b6��q���#YJ�'�r�i����a���e�ipn��?��A���Oa��|2�ByS3��������_�5W6;[�R����p9���I��A#Q�~M�7��#@H�N�>/����(6[(���_6(nKWU_�8����i�p�r�^���c�����g��z�P����l����{�Li2$��}��G%��}����]y/a]���;�t�,��Bb�]�?�����N���z^Z�����H���W-��a�M6���^��WD
��x�}g���	�������u���g
QRV�?�I,�%"���-��(c���?�����,�b��x�H���M/��A��@D����L�$KzMv��z�yke+�����<���V���a>�N�4�^����m�,�p[�{���r��w,�����
�
��M��'��*��m�&(�=��T>r��z6��3v���.[����]��b�}-US�
��V����	��z��2#���f�����Q~E [K�<�����N�Y�������������0	r����l��J��@��i������W�{w������Bd�}��>���n+2�N����M��Zl�T���H��G������f�2LC��	\�Cq����5D�:����2���r���M��;�d�j�b�Z�'�S�`T����_�?X�L�"�)S[�	b\��~T��n&N5�|��W��|7��H�����:��p�����>���Uo=+u��J�y�Gb���E����xI�;<��|����~sE��F��?����A��z;S�B�t��T�AJy��9eu�(����	"�l;�K!t�q�����Oi��^b���/�Ia>��w?r��*|l.}8�V�PG��}{L���t���������o�^�������`Y��#@>���}�o��^^����5��^�z�"*�
�Z���#�*�_�~�Q$���z�O_��m����\o�/L����@�7^����������5&������RRF�?�3m\�$��� �TW�V&�H�G�/��5�jl|K��VIX�=�isy4�i�L��y���.�o��M�lU���k>��<US��_��}�}�7u��"��gO������4�K7�!����T����H`�^���*�
�i��������gm�K����^jD,��{t�ZU�E}�n�D�Gc�2�Q����-	U��B��R������e��?K^,4��0���rV>�|�6���z�������Zj
�W�i����e�r�k[�4{CYf;�4�Ea6G��J��R�#.=-��=��eZ�n�R8N��	��w\�x��5�)49�5(MD��A�2Kf�ao���4�
f��Q~��������C�w6���?~�����l��r�)������?�wI^���E��&��1s��i����7��'��pZy�_�����n�%��f�(�Y���r����^������Uxt0��]�"����8W��������(���+�|�X��S���^d�{|����������6�W��-G���W&��$h.�f3`k�U�/"��\8�����:���5]#�w���`]a6��Y�~�.S�����P����Mx�p��`������0��^�-�*9�����?S����\!�1d�����o��uc)�e�r,5o�g)�icPL����A��7����y�P�����i���������4oz_���9�yay�%�gH�P^$��2�F!��������%�O����@��
?�|�T�'X�w�h�(3��1L^�6���n��B�*�2F1�����"����B�0Bk�Y��y~��;����q�rI&��K��1z����X�k	��E���e��9O�^y�15C�����Q�)�{�k�$y��nS�
.J=�[�1��G��:}w��f'���
�Vr~7��k�U����E����N��:�5=`���uUL%��M�#���s������5�fV����/��\�5`4���$�]����q
<A�������qw�M���o���s��?n������T����w����~�C��r7�O�bn�_��8��
���v}��XR����1�7����<�~[��J@�r�48��Fx*�s�s���u~��Y��\�L���1��*x���W����6)��",�
r@��=#�R�<b�>��������Q
v�a,G�1!U�Ha���E$���H��3U%�`��dC-hP nMf��}��������%X�*J���)mW��F����5����2�G.kn������x�=P�x�P2�������U��O��Q���2����C����@M����r�x7�)4g:c���j�^q�����PI�������|�m��D�I��R�z{�ahf	?��r��>� ����r�T�����8������d4�
�P������UA�%j��+3��������������;�7�q
����\)b$�3���y���[�4���gL���f��a�B�y�\��2B��\����X�1Y62���*}��]����FtW7�w6X
�t\�]^$��Z�AM�8��KS����
����#��+�1���)������K��8eP|x)�/��U���rR����
j�����3E����
����c��4��q=���[���`x<�^�6�ss��7�yw!u�V"6V#{�<�,W�4CsF��������N���$�����e)��BF�6i�]�����c\������<�
��0�������^�!k��9�	�$S3�I�����L_�����Q��Zt�J�Y�I��6��Ty��@�a��pu��|�M��0�!k%P�@�	���X��N'-7�I+hH��1��
�9=��6i���BuI�H�*a�qP��dwv�v���-�(6X����[�N@FP��@�j�)���e��,�&�oN�������|��4~�v.�L-j@"�i:������y������8Q�@���5����VM0#p�-B)�`ho*/U���5W�"F6G�)��u�6�{�����;��lJC�����t�4�*����I)+�����QI0��i?���?�`�N��`9����d^;��r��4���Q������cHI����Th���II���@�]	��x�����Z��:bu�c�Q?�?��r��2�g���C�'%���c�*%���r��l��������W��*�;�*@Q����Zye���xi!$��p*z���{���T��QI�ym��WV:&��u�*��(������E�<��;�����1:A�X_m�k9Z)��4��H�{�dmf�r6����e������~�1�D�$c�f.)"���U�l��`��-��].lh�c�Y!�2��>�W���X�k��UV��^�-`��7.V������g"]�W*2��7�2�-X�K�����g*3\q�i*�-c1�[N�+���:�O��kSc)�/hYf�n>a�x
���/����-�����H�*S�Q��n��B�PA���3��G���|��-�U�y%�>�a�%i�.�=�Jo.�.��|�"�h}�/@��Fx��T�~G���V������A���( ��Ok�O��������+LM�������
���0������E����D�9��qh����:�8fT����T��v���Jf�fx+����g\ hP����	_�����������LG����h�B�u���j3|��-���1���V������!"��������&��lun7������^���L�����tL����-��9fR����S"HE&�&9��;mJK��=����#b^}UiI9`	>Y�
T�w�1�!�����-��E��2EEU��{P�$�LE~���g�@�Z�z-�w�G!L�E'f ���'c����n.�Z#n��|-��N@�v�{>���fT�_�s�V*�b�$�K�^%BFh�F>B�Dn�7Q�e�&�{<�����ux�k%�
Gt�W���&��PF�A��dc��$�	��6gN�V�/,mZN���GNEV\����0�����Z"��n���������R��U�X�{�,d����������>�&��W��-2L��0���S#/����|rY^��(����}t),F��*�����uR&�dM���-^L�4�K������=s#�]�K;�����M�����1b8li�:*�1����n*3����iQ"	��"j�����9]�N���-S�!��I�U��k8��{������zqn^*�]F�=_�{c�5������	���,���E����4�]�-��P�7����J	�j�}������<8�``�&������J�Ia0:���WLa1��s�R��3d�W���������h�UJ��u�@I��l�y{_0��T��y��j�y
l�0%(f�W�#����Xc6�MR)T���g�y^����x�����vG���W2�T���q]��+-�V)�7o��n�i���rJ��P�1#;���Q��w����������d|����r�M��j7�T'BC�`$*�6�����D$�Z\D���f1��?]o>Q�,��Z�z|1���Y����aFK9��uT��iWC(U��O3Y��YE!MVS������C�"�S��c����S3�R�,�vB
��H��*�M����:(n6Z_5<�-�':v����?�)�|�^�E v*�5y���)����L�J
����w�b|�Bs��*�v����
�7
�1�Y�yc*h	[
^�;�7�Scb������C�XS����W�b�"��F��SA�cDxo��"m�J~"e=��0v�$
��� <�7@��l�Zt���u9�BldZ���h��7|I��E���?G���l���l����~�������c����._������`��9��Y)��Qq�p^:�k�Srg�h�{j���_��y��lV���sCK�:�[A�b>Aq���qCz���P0��AI�����Mb=W5['|�{�}�T��E�����GHJ�S�x���ze�����B5V|�����)�_3kr�#�n��=r�����[:��^k�e�{�2g[z����];��}>[`y��7�������{"���Zrm��i�3$^�*��7[1��{,�\&�	�O�����E�-3����Ks9��:Xq"Ly���3>�)�T���v	��K0�U��5��;[{�Yt����%�*41��k.,hh�h,-h���0`�!4]v�1pi�?:�;�zX��M�tll�O�yO�.��T?��e�>��x<"j�!;<�������g�����������EK�.��<_5�hM5J�O��oQ����J5/��U[hjT���\J������="�\�����=n�� nx���k���mk>��U��Nn��
�k#s�qu]i��G1I�KMW�k�K��
{��sq�����<�Wh������.�G����ww�O������t������R;��4����<n�6x�oO�����D4��As�]_[�ps����p��u~>�6q?�M=x7(�?�����������k�;��B]�}���!�+�C=�C�S��K����,�;+�GG[7���<���]������U��������SS��1�Q��	x6�yx�k�Oy��nM�m����u�<��fn.�h��g5~]$;��yl������������������J�����g���
S���yp�;=�N~�"���/6�
n�3�k[:�Y����#8\�������P�m*y���:XV����������,��,������W����.[]Ys����#qu��;����$�ma���us]�TM���*'��MK���VN�
�:���p1\k�|[
������t�<Pd�i�������YA�X|t�� �r{�/u��U(�EPo;������"!%0����L�(���������Z�o;����c����(r*�*OS7g��a�I[�����r+�D�!�C�s;W��:i��6U�K�n5�t�iCI����uc.�!mzs�����S�3������1_��jl
]�6��O�(}��U�&���E�0��Qb�����.�����:r��X�������u���}>����oes��V6J�h��h8��@��UjVmv���8�D������8���W���������m9m#B��F�9��>��*�L�������������)j��`�G���bQ�*�����4��d�m|���)�f����W��p�����2��C��8R��k��m����������H�R�IK�~�@tYD��K��������M(����, �9)�
�g��Y�$D��?�
~e�
��a�K�f! ������A��/G�7��������o�o����#�$d�D$NE$�1����5~���6�)���(�I������J����P(���zvDcO&��AN���h�L~.�e.Z��|O�$7�V��m�$��S��06��gc����I6Q+�O��n��B�����C�Lz�b�P�q6ot����eg����#�Ob�����Gz��Ub�����@M�o��>6v�y7���^ur�x\�
�M	�/y��t��n�����|���C���jrZ�{Z�z��K��S��^�C��\�5���0,�4�Y�.��N1��2�u�)}�:��dhyU��%F+���~��e1X5#���l��4����%B���O�������8Pj������r�Ed36�!.�:����D�HLVC�i�y�;�&��~j���Q�H ���WK�����J:�y����&��['_�����#"|az\�cm���(�U�
��uDX*��f�^�gG�Q������C��������0"Gc��Er%C�������jeSj5��
�*��+Y���a��G��"�7���3�+��r�Cj+�z+f������}#"�����)��?k�������HbWG]���%�5��F�FK���������`5"r;���Kqe��RM"���g�Z��p����+6�jx���+����`�!�.�T��N�%d�����3��+O�u���S��P)_��
�x\���+�J-���%y����`�|�����\A��50d�l�e���i��hV�TJ	���5^X��1��Em�==�
��z��CC^7oJO\�#`PE����.�I�p�PW��������V��o���p�������8�V*JY��U��+@l�������0}qx�
h���;��:�����!���D�p���R�����<))L������7����j�i�c�.��+�L]��h����b�`�)��{�b�R�x�d%�x!C����xRs�H���4�iV �����}��X�c��t�y�W��%/���~�D�rzd�|7M��V"+���T3��kLl�g
t������+��0
�|������;M�,�(5��}#���%h(�������J�@�)n�����&�����5�����B;B�������3@�v������I��:o�6��]���F���1��$��LD��mw[�W����������Y�ie�az����R��@{4��r�X(Z	��:�����,*���T�2�'V�}�����;g��S������7g^<��eC�FItS������B�8��>���/='.g<��]Y[K�W^m��������r�D��h5��f�cx"�]s�`~�#=?��Z�&�I��H\�B�������%�g�@:6��g�n}KC�0h�a�]o��H��L�QZ0g��h� G��&��������R�9^1��Nqf}x���g����*�������V���QNh�0�AE��8�_S3h�Q'%�R�4�����
P��*A�>��
�p�
B=�G����h��F$�Wpe��l�|��S����}Y�����Q3�"����f��TH�K���Z�Q��"��,	��h�P�pB�?�����<��/���������4�����G�A����x���;wJ���'�����Oux'���������}3��������l�'��>hQx
R��]���MB}b��5e��
hP[V!��EIhRCE����~#���q~����`�&��Q���O(�6�f���X�����0NS���9��������
3�m��Z���\�������*�Gu�mM��A�%�x�R�^���.O}|�Q%,j���!?����C(u�sC��r.�,���R(a�C\F{5�m[Y��N��,��T9e��K���(�y��f���V�'
|���2�jp��S'C5t�7h�1��U��J��zK�N�5@���h��.�r\��l���f��K��Vf��w��I�-l �IQ�f������������Z,�-�4��/C
Z�}�Y
�=(�
��}}��a���82+m_o�HS����W�� ;�c+�f+��x�b�ke�]#:Y)	��=�AK����C|W�dv7:�m���"������pBy�b�W�Qz�N���}���+ZKFy{~~�h�a���`�}�������z��D}uyA�u����A�8.��G��[���:����������	]�'���L�����g3�����Ii�KV��S�W�de.���i��q�'��m��w�`f��?���0��P���7�3������N|���������.^��>]2�%"�;��Ov�pK�����y��F��:q5�5�d�����_��HA<P�*I��2noKX���d�m�� l����O(q]���
	B[K���$�a.�4V���(�K��[h���{��A�\��I_�+���:�+�X8*�Ci?Zc��[T����y�P�v���lh+lH&�@�_|�V$Li5l%�gb{�9�g_�k�EB��j�9F����������HB��k0���m%[�m5�7��_"Ls�g�B��wl�
������� ����l488��a�Q��0��	�l+�Sd�h����������t��D����
:�^j��G�y=sT��DQ�%�T���T�����0_����D����9�	�����)��\�0Q�U��tKf��,�U��m��,�
�h�qn	���1!K����?�)���1O�O�e�h*Vf[�1^\�Q����Gd�G�����4g�D�AId��\a�Y@=�G�D�!��P�vT��3����p5;e'�q�����L���e�Rp[�8��}�\<�)K��
�6�c#E�#o�����Q���}z�bk�T����IB�����7�	��z|a�5����gy$�s%!_"���zHp����+� ��l�j�(����]�F�FK8��!�StR9��N�lm�>D�/nOG�Ht9�P$�@oo��)1�$G02x��`c+r5��he���A$�:P����������GHa����6�'��dXNKe���r����l+�f�F�G-�0'��������f�WD�0��J�S��� ����M��v@��R~"�
=�����lx[I�!�G�Fsq�A����������E��[���u����s��Du�����>T���\Vc����T$Z=S	��@�����@I?��%�������g���2��[[�V8�L�Sl3�'X���3����
|��[���E������R�j.�.����[	�/l�8A�
E�U��GN��������-dN���b��4X��s&�3���rL����\��m���*�%��x!��T�X��wu\�V���w�H�-�1�	��f��5Q�!� �1g�
mxPp~����z���-�(X�h�^
�M���|�x�c;�������w�\m������@�z��*q��s,���>;���w�7m�:|��N4�]�"*�j�E��"����)_�n��.W���2���(<k�q�bq�$���k��=�T��j�b�M�?hA������c��$�{0���i����<P���y�jA��4
��[|��+���� �z�tQ�|\�C�qq�^��@x�]���0���v<;�m��DV�Zb��U<�t����RY�PFO�����h	Qx�(o�^�$�+������7�"��%=�+�\��e	�t�zst�y���6/��w@��\1���%^^.�7ch]B����w>��wDmn:���`����4-<h�
FkT_g�U2=N|e�Q����
�����{��w�?lm����j�`)���Y��A��}�P�!��=�U�VSI�s������2RD���2�{�J�h�Nt�s�����v���h���T�i�~@���R������	��X�F�Q��co�U~)�*��]���[�p�Xg���� h�~ro)
C�On�����	."B���������o}�i�*��96�
�9bE�*5}1�	���D,��]%zP�e$$����c���2�b�:_���jX�#�&�����CL����MC�������Z���O�a��J3]��Y�d���� �	(��0�
��d�����bW �_��a@�����o4F7�H���� F�T*'�}Q�W� ��W:F?�<V"+�
Ei9���3"��tG�2X�&*:t��3�l����j�4��9*��Z��d��T`��R���rE�����&�/!��Q#�+*w�	e�*f:��,��D:"��4c�U��EJxL�}�Xp�M�,����P��a	���^���H�H��6���Z3l�����u�fGa���B��<�A�,�],V	���p\s�Em�c�4E��fx% ���P�#����p��JqLAKl�a%Jt��{�������c{��E��6.���"��7*�^s>��Q����*`^8?�������F__\������%�Q�����
��z��s���3����$[�Z�2�������V-f��%���l/�P����gyw���\ko�v�;������1C��6<<V\�#g�������yY����*�[ry�����Y���t��q��l�����8F��	(
)�����{��\�ic%~P"�E�>���q��8~]��v1�1�W����=�_O&�mG�B7x�o�x�O �X_�������x��w+iL���+�0V�|��+���RK��������A"���0���L�:I�����T����o�"$���J<������4tC2��p�J�x}�&h^DP�7�n�H��!�XQ�a1,���l28�u�A���[���K��Y��8�����k<���a&����a��=[�$�y}�A=��a��X�����XO�V^I��O�g��3�4C��Zg�L���^�k+�-�}�[�_�V��o��t�o��X�D�Ej}���M��E��]�[V{	������Dl5���Z:(�q���[��)f����k��b�(L�]Zr�8H]H����&��k��D,�����]�f����5��,�g(c(�:�����C�J�����4�3vX�/��'6{��R\
M�_W)D
6H�8��#d"�$�K@��B&������h��G����Br^�������C��Q�mK�Dc;����OGu�����L����x�1�(����S","�Yl���X�Kt�K���@^I��Z3��X��Y�a��<�{v>��f/BE��s1Kq!�1\����o'Kf:��*�m�lkt���@d� �Dr�^�j4�����`��Cr~C]B�i�l$k�<��D�L)O����LG��(����oo��4����F����R�`$N���=�l*������D=G]�hNv7g��i�X�A��T�^�n���n�&P��[>��i9WdUY���rG��2�^��1��$����c���P!��=�
l�F���xC�ohI ����G�����q0�qx�@Oox��w��E���?���?��vPS��Xm~�����ue�[*��$60��j^�u���? ��u��-�;��� �q4jU��:��4B)���L���&���8����������:tTC�f��z���z�v��Y���6�������������z����k=�Yz%��\)�^�*/O�9�aG����IK�w��
t�@O�IM(�$���.yYjU*w2��d��c�9X��P.����Zn���j��SZTll^�J��]�B�"���>a_��:)�9���W��(�����G`q������&��u_nbQ�P�������H���WktY�<�pO����R�������y��O
P����#����W�����u�����5�\�
�i�
��Yj�m���0,R��A��H��T�#(�����������v����H�lH���v�yc=�Q�H�`�3z����������
����1��U��`�|����~wy��V�����f�S���v�b������(1�uQ��b�� �x�=�������c��ki���[�TL���":��4*�����6�V)��H��s������Eu���$�wU��`��-�Qrk���n3��l�N��7�Q���
�,��5�n�6��U' >+�b~�<�E�S�po
3;TM
�s������X$�����h�5T����1������@���u+�+T�-��{z.���'����p�f~�gW���O
�#�d
��W��oQl�sd�I�bN��X�U����+�3 �W������J��e>�>D%)Z���r�O�"jAf�z��4o��Q����'Y2
���a�|�	��<y�+�:{����zT�j�Ek'���SbR7|
�d��`V:&i�{eM�"�����E���]���Y�F�+y�B�6�
M,�5T���k��h*�`���0��S<B��sj�+�����^0�I��D,�������6���������
����'1����T���27�����y���A��	6]��xW���k��� l�}x<N�zs!�aM�fl}��|���M��9<�<j��K���j��P 2jn���B zy72��q>�H�7�
���D.���5�d�<J�u��2d#Tb��ES*��]G�MG�����c$U���{���x�B�'�%b��/9� {���f0���#�������w�0�V.�P#������q��bg�u�l�r	�U����=|����`2QHH�������F�L������R���1�Kv�]�VF�7�I�_
}SI�7V�]��������+�,��Gq�H�{Gp�r���������i��[��4s�r��������
-�0�m��T��������vq�e9?<I�K��w6*F7o�`YH@�kM��6N��^�gZ�_c������b����g��^�}�k����:�=�����h�y'�k��Y)��[`e��c�����g����NAvj�i�*|)��Z[Wi��~��3f^;�8��>��,N���o�6K��^w%��?��@���D��������UJ,���Vs#y{� ;����h����h���n{
FF
�*4�Sj�����:R��f:X?^���K�����c�4��3���BU, |`���	}���U�W�.�/luXGw�T�ze�f3�e��_�:>�{�I���27:{�O�k.���T�� o�*�=z�|@��Y��~�%�j��R���8�'%��1�hG����7(�8o����n_���Ark_b�"�"o��������vt���Pw$y~�,�1�b�~z�C�� Z�H�oM����$F�����yK����^�(*oW��%e��1V�,r}�����j�<���j&\��WgQ�j��U�j���V��FQk�l��&+Z�)bEa~[�T��{���)&�=�I��~��7s�A4�&o��e���pSLg_l��`�Fte�.���z�)�<�
Q�*����'L����]����G�s����
�����_��*���b/U�m~R�'t�s.TV��k(���L�n�@�r���t���N����@b���"���z�t�5����/!!�0���P��'oy�Vd���T�i�j���Z$)�s���)���/�H��{+��\Y�`��5�{��c����������~KPJ�k1,f����>M��&�����U���kJ�qzz������0����`��f��e:1�a�YFq~�r�E'�Y�D��I��f�s��u�-4��H������)��mT��7axk*��Q�+*8v�+��*i�T�V��E-z��b�S���
��uP�7^^}���q�lM��T�_A7�^ye��RNg1��l�������QZI���+�sc��\�>"���[�
+��\j��X��Ps\=�h� }|��Ncu�y��{[�*�`�s�uO���P�v��{�4�������y0��r�p�n~ASc��quuebb�|?��|���F����4������X-	<8jk��>�\6�]Z�,��#lL,Y����V�"�P=�r�98ij6!�/.j��q����=�������h5\v�d���d���<R���E�m/_\�1��}�
����/^d�������x���������������J�G#[+c�x����CB��kdg�c�ml����i�,��		�K#E���/�^���C�H/@�����@�hn����_�|^�x��?��y�o�����e��r�_��D�������u2����?�}����/�--�
�l�m�--�V6��6����
�����y�y���dkii�����-��E�����>�����\�
��*��CY�Qfeiege�oa����s�:��m]]}3}�����������S�V����8�l�E��������E����45�o��9�����i���-��h����F�` �{��O�����k[�lil��S�@ �$9�D�CD?q9^ja	A�U��<�G`�PaE�����������`�O�H���A7�}���,ml��-��A����o}�����>,,LO����t����YX�Y^�320�z1110���g`ef|AH��g��c�
!!�W���������GJ�������_"�?1H����X�Eh�-�h,�� �X�[�Y���?A`e���0����S����)BHI�Ac?8���d@����#$�4/��n�A?ZRbr���(((���������>|������������������)pqq���

������KKK���+++���kkk������[[[���===}xxx2�������r�O�r�����sl���1����`��q��f
���?%�Sd��cT��������*��zw�EI���]���\��#�����;�{�[�����q
5����YwS����/Hp�	ak5�9Z�^7.�.1����[!;A	�(�h��*QR~K�'�����Q��T��1�%�X_�,K��V�vvBcPnV��p�h
��_�7�z�}�������v���GG�F����l���N���w��*�6F9�

?�L:~w^8�3r,\���Ekd�.W�4?+*?r�����RD0�>V��V|c�������0m���W��ip,n�I�� ��&��c�wk��JVWM�G�,���E�es^5j��U�h�z��;qN����H�U/��};p��=D^�+w��7'3���u[Ihl���}���J�hfa:���M84��Ki��AL)�"�A'�iZ/TDO��nBy�*D�U@��i�<t��LPt�G�L������e�(�P�$��&�4���E���b�����S�p\�Q'
��a�Kg0����1m��J���[�<�Ir��ID����YD��\��X<�w�GJ!K*)=UV.&_$����T��D����>�i�_~��Ni�9����]���`�a�=k��-�
��f�T���=��ip��y�%d9�B�����H���3�*'�(�+�H�q��6�W~�����ByJ���;R�#%/�l���Z�d-�y���"n�)������9��=�7b 
�p�w��$^���	���ju����8���!��a~JIE���Q�9��%q��-oi����T�s�4^��n	ye.����qy����3o9���j[q|._�����M8:(�W����Q�+�)�Q�k�a�^�9�s@'����7������:���m�����[<oO���D������j��Xz�R$p�f������OX�^��y�r�c�+���W �������h�b�kO���K��t�:����"���jEP$N^�W����X����:��|����)sX���!Ij��UwOs ������k��Y��^!4�
�r����<]w�FwW�i��k8��+l9h3\iM�D.t T9�76���
RRla���������d{������4��1�DD����_7��-��"}r�����B�N)e�1V�*c����(X��V�~F�\A���Q��|�R�?���q}!�K��9Z�-�W��Q��=��������|��8��i�K�����C���MKuo�^��~�-�����w%�6����E���+����g��z��=��u!�D\�M6+d
������k��-R���O<��=���.K-�!n������"�_{{������@�O
��k��o��lb3�<���JMt�c|
�����;g��Q}d��v�b\��"4^pJ����w-���i"U��� +�k� [��z�A6O�{�9�]�5v[���;{������D�d�T�
�=	��mK�Ou�Ah�A)yi([a>��g^aJ1�x.,�%����o��UW5R�2K���U���pVe� ml
���-V���0S�^D�����aV���	�������� �v�W:.����:��Ryg����������y��5 �D�E����q&_)����Y�Iq����3|��vP��'[�4>Kd����[�n\��Gx�.�U����PC����J�6�M�U����m������#���p���4G��l����#k�������Gz,�p�W~�U�~��$���v��u3%z�7����S>5X�t}�KN�'_�X�<���
uL�R:`u����;H��V���<E�Zc�F�;����8�I�T���u���}W_7Bn��B��d���%&�X����8.�P�*��#LVe���)x�+e��3�������F����������?�cU1�����E����
��Q��Bx@e�U�rK����z���xAph��Y0���E<��\a�a�B�8�xW������5K���$��|�R��I�7h)��%Sh8��F����W�#����xQ��B�!sK&��J��BI�m�` 
q*��c������q�	�d�)�&���q�`�]v&ZL���x�- u\�����sl!�EI]��
��K�v���
���lJ�'oJy�!���&
��|����Z�m���u�{�����n�|uY��9OI`�h�*����1��W�7�7N�V��,��~��7�>�����D�@'0�G�����������6A�Ja`�����a�e+i�P��S*
�'��c1���N���m\D�����6[e���s��s��&_l�����������Ut��X�2.?,'����Q�*��y���g�-#(�+�
�����CYg�������+��_�wxk�L���aQze�&��9�7��W������G\Z���G�{i�AU�M&������A�Y�
6Z.NLY���2�Q�Q%���?�U��8Y/��7��:�y�]�-��AK�v�O`q 54sK�y��R��RS���U�nt�fW�nU�
Vd����q�2��!m�qu�{�@�����.�'����>���)=��58��"��n�) ��������td�a|������V��?�����������I������P
T��x��������pz�y����+���A�����
x���]|)�2�B	b$���D��2��K~����>9�T��v�?��
	j�
�J`.��0f��s}f�[��L!�h�4��+�@L��+�(/��8#�c�k��R���!Y��w��0���!
{�z�I�(Wlw��}64Lv�x������ � 	_��qfq |*OJ��F����a}_�P����#�	9�������	�}�(�nE���!��>����PX���b�Cy�
������e�OV�=��������X�k.�r]�������:&X�\v�;��;2��CgLc��o���vh9��G�
�����:�s7����B�f?Q�4,D��]����{���(l?q;e�y)��
qz#�	����x�	��
���w�FK[e�)��wV�M�Xd��3w����%u.�;0�	�
�Zd^~t�
]-��
!��	~^e�����dXbP7kq�-F���&_�t;-������%'U���wLN��}��W��C|��o!����|���dZ�tg��*�����.����2��:�Q���E���\����%jx��n�l�=M���n���_���)�����5��xjI�v�1�:U�s�C��{o���W8���f'#x�\�I}�,�s&|�DY��X~l��b�n���X���#cN37`��\QB�|4�4<�����l���\C�������q����o��8�$_��+;�*T#���oR���i��/P������i����Q�#�'��j\�!V��/���M����Mo�5��!0f�������b����c�gb��,Q��Mw?�Eq��jV�+�����Ru�X��1��U�x�����M�n<'���,i���y�,��,�&k��VY���`��#|	~.t��3&#v0,�$��o���B�xG�^���;c����6���]�Pk3:��
-�u��_��@*;"���t�x"k�i%�C�������Xj��jg�b%�+��a��)��WG�_#�#�9A���wG�]}�3R��:�)�d��KS��F��F/��?<�r��[� *�V���G6��^f2�1L�n��='�`��*���+�~�8t�<c���p��Pa3'M�
�8��($������0��_G�`uA������Z�4.Ht�9~A��B��c)��K���d.����MIO�N�]��M1��U|��6d�<S4~���4��f�F�k��)��mC�����W��cnMh3�Me[c��,�gi�j<��3�M"z%Y4���97�D�'����'�_~1�L���5"�p��������4�)Z�h
e9\c�����}��������E��������@�
gMG�FZ���v�3�������Y��q���IPb'�DZ{/���M�+�3xX�C�'}ci�������`uzh.��O����Z'����F���a��Yb�5O
0}K*��2���?�F��]N���v���������)&��M`0$N���b�e���HNH:a�3{Xi�g�����F��N���1l��JO���c��1j$-���e�1������l�M�%L>�Vk+r;����
�w	���
mu?��Gj`��*�����[Vwd����#Ip���W����v���8���Wap�d773���47�����R�KV����c���
1����}G%��%��
�6?���>��3v!�����An����
�c���m�������<��X���3�:��&���L��P��3��cbeE�V�&�����b��c_R��W�-��="�'���yh�C*R�3]�Kt�J�=y�y?%Y��^��������s���d<����u������$�7���NVE��W�V��G���6{'��+�Y�����{A��������w�����1����w��L�S���~���@Z�4�}v��D��D�	^E$
*!���s�)?R�wj�-<�H��M���L	)"�-"A2�������2����||w
�7��8�$��1��0�1��#�]X7�I�6�i�[hNe\�o���y��tw��}C����4�J��3d���a�j1��+"��P�aZ����n�q��&�oM��O���o�/��D^�;�V�4#SN�E9������Ub�m��Qg���s�3�v�1 ��D�|	C���Z����Ss�7�e�;���m�=��B�7y+8G����W�$���V��U��v��4%�(���{]�a��
������*���I���CB�BJk.����r8��XGn+�K�7�=�q�L��k��6+�����Obj�k7�no��w�^)K�k�K�����eYor��?S�^@;�:.�2�*x�����5z�(*
��+��VC.JK���x�	�0h���� 1a�Q!���0C���M��GE��J���l�L�5�f3��W�Dw��������yH����Z�<�S��U_����Jh�SN��i��fey�|�daS�*I^p�9��9a��u{�en
�G�5g�R�H(N&�f2��m0�urU�b�sEuG��H�X���9BY%x]�W���av�������<:�\'�+�2]���X^����-���!����'�3{��rl����	7����>?*����P�X�����iP�L%Hg��ZN��,�;����Q_�0/�n��ny����!T�����,�[�4�+�8�p$}]E{�c`��#w����=�9��R_-���;?�5�{g�T
[��
*M���G��6���s��}>�"���2�v�u�J�U��7Q�.���^0����>��q�R��2�Lp�voJ��j)���8\ar�'�s��u��q��F�MKG���
�%	9����������
B	�&��3A|]�������-�<�m�_8����
xz[��O)�l�����C{�:���dq��?�Z;�7�L���1������M�	��m`��������A��=a"���{=��s�^�5wf��6���%������=�h���S��T�<��BD�c+��"H���T���,����_��l�Di��FN����������c��wX�0��^��y�����:W��N*y��q@Y}�z|=���[����)�/, ���O���t���$��x������}U�,�;O��1�J8	y�G��L+��=��q���&����<��[r��v*%���6G7�+�Y��^o����A
�,�3y�0��:Qe����k���V��0�Hh~]�9���/	
��,�h��>������K���c���lhp�">����iA�e�������������s��-+��������VW&������^��E�IGS���z��m/���!
��c����i���S7��rV�J�T��`�H����-�m���s��(f#�Jw��+"�(�u���I��������#��<�����1N.Z��m�&��e����Tv��������j,�d�|��	
�x�B{��	�cp�~������	�q"����,���
�d��j�'�u�m9���������Y�"���������Q������'���"	�����
�����FD;K`�����)1U#Efk�n$�4R���MUw*��43�������wWf!s�^=.��_�L�?���f��Pg���&�+��#��3�d�%�@�C$Mp�y�["Q����y�1+S�O��{T(�����p;uN�(}o2���<�ga_�P���n��!��n�-O�k�W��!_��"�D[�x;��O��>��3��>�e&������#1�b�����I�-�����b���y�y�������Ap��@����;d2A�~��W?�y+�&u�E�Kh���vc	���q��nQg��(q�. �m ��������J�wg�b�v!(�6�wh��u��W�U�V�1a�1 x�An�~^f���:z��;�(��%��5�����su�-.����:�m%Ik7������O��������S���C��V��o��'�LS�P�����7!]��WX_:�}�ID��:�i���":�ls��8�7:��E=�"���J_�+f�8��Z�%^+����0L���u�/N-��8�I�aw�>���qRt$��adw�����a�E
��b�� �p�m�k�K�T����w"�%�:U�Uj�twI!�c$�[�!������I^�h
�lWO�Y~N���y�B���w��mc������ !���k�����h�sy!!�L�,Q�[9��&���t��_�y4M����j��_,�&7r�����N��|O�@�����oG@;B]�;0���we�<�ay���D(K��#/�8�V���j`r�Dw�1�����?v��?���>m����m��������������B�s���:��(���
�5�����6q��O�����1����I���a�J���3=�^�x �:�k{ p���H����I�g�����>g+=:#3��={�.T}@lM���$5O�
�2�=t&��y%Gl�5��=�|����S:
d��Z
U��V��nS��ul��cYx���������p��Y.�zG�m�L&?+SG-�������7�3��2x|��tK�q��_����������8*���l��K�H�����
�am3�F%�m���G��_�/�������W���O)H.i�]���o�\�6pm����Mz������e�x_��"��z:�5g!��|G�A����T3a��z����_�����"�� #���a�[��5�I����w��1E��L�����v/��UU�����h0�>h��Ap���*�PR�|WV�~�Ig������L"�Rf��g��w��0�m��(%��W
�������C���XC!N��2H�����THB��"&�B���3p�IDd���[47I�M��<8��`*~���n��d''��M�;	�����S���>#�;�Fu_�����+�sdZ%^�M����7��\2�����#�������P}�?���T��[����g�$�E����V���%q�-@��s�	-[�IlS������K�O��L9��c���H�����W����,��~}7?&&6�(-���E8]^pbB�����z�������u�j��h�q�!m�+��=5�V&���o�����
��������f��_�o �KE�2����|�L��r���@&Hg�W�?�����O%|�wR*�=
�������0�Vn�1'��g�������f��_��������`��PQ$z��&��
X�������U���������n��Q!y^8r���f�v�^��n�~0O
i������>&����k��fJ�,����
W�2�L�2��jp�#�����C����	A'�H��-�oC����L(1����Wj%`��U+nne�����T&������-=�o��jrm����O�A�@�1�kGZ���W5��rF������m�r}�mQf*�3�>w4�� �������D��DF��qNB���;�R,4��?3^��&G�9� ��.�N�^�i�t����M�JV����R�;������V�DD7�i"�|��ep�r��,��>_����x��r�0�9#s��2�d[LS��_��x7���iU����'�S_%�I�jYI����9��Y�8_t�2�b��Q����D�"6;#�+/����#�C&N�������8u��%��k�m�X���u����;������0a���7�O������v�Y��Z���#)2���;��M�GH���-���J�9�b�]Ko����s�^���t*���$�t_sN�',��w�����t�@n�"��M�&^���B�1��`���l���������|VVj�c�RC���o����A�Fs#O���3+%����:!�2VS�r�	p����j����#�@_~��t��A	�D�E�N���]m��;Bv��u ,k��+�H�����x��%����&�R��u�7T�e�9
X��b����x����
|��V�#�F�9a2�>qFB1����o�����4�D����wy����W�����'�}�4��V>�b���}��=��	��,���f������9��{O�&vK�Gv��bG�v�|�"��}�s��8����pwwG�������
��X���=�0���\���V�$f�i����d�s����>��� 
V�)�g������42;DA�4L���g�U�Kk\�� l����4���K����1B�T��@�sf���m��f�@������%� rZ�W�q63ovJQ]�=*��=�_c<��wL����<�=�A����_�~_[��H��Y�?I���?o��_$����?Ib����;�/r��P�Or�E��q[��b������y��=���o:D������i��l�zkL������hOb�C#��sw���
�B����q��J�[��t���~6��������������$��;6��H��u��`����f�5*K%s�=� Y��e���3lgy($~����+6��*�����n�7�KB�����Rh�J��J���C��z}l�qk%D;�O��[R�W����S?hQ�M/~����Xn(>P�^r�2r>�u��8^"���g8����tz>�=b���2�8�fx�
X~B /T����W[���{���_/������!.Q�trr!9��8{�����3��Q�m��yD;�{Do�S��#�=�E}��np�W,-4� s��0����C_V#�IV"��o�2����#��Q�H���'_�'w�� �H���w		n��-�NL�����\����Bd�#3�GR��J}�������Z&/��D�q%{�KBQ���J��u8�ya�)A���z��w
���;�C�����C�5W�2^d�C�<�K��L��\�����2z�������M���N_K���K=6~mG�o�����t�������0<����{Wh���xa��LQAL����L�Z�o�g����^_�N��	
F^,F�x|E����u{��n�~~�n��n7����ta��v���mJ������~!�V�����1��X3q_����a�a{�r����`������n�0��r������D��f�j���m����~?x��x{q~Jp��v�Pg�tg{��:QE����>y.���8GyC���ts���d�t~?1����@���c�1�~�~Nb�_��g��|!�p�����+)���:.�j7/���A�q��6��~3��n��$��r��($��p�Oq�@[���~����� �N�!����� �����=`�X��t7��nmA�<������D�~{���q��7�V��}�����1�����MH�����A�-oES�b��A���n� ��qL�����q���*��>�IS��c�!_�B1k����C���h��a ��	wg������}n�u�����6)�qc��������e������e����aBy4`�����G��Z�"�/���;�����M�Q����a�C������������!�f������P�\�J�����\]�b��c_��F�����f�������5�g*O����C��U+k�w{������;�����w���Y��r�%�OM�����K�����[V�&wk�eg���������&����RK�s<�z/��k�F�G��e�;�������>�m�&�4��N����4]o�!4�L%�Xz��������6>�;�
�?.�9���[��x�y�0DC��������������v�x���h��c���xp0Tw��p���~�#���?�YA����M7�2�����n���f�O����u
��[�����)��U��!�j��Ok���.�y���KS�OO�h���$	f=o��x����������uobX=���]/��������5�.k�{���&�j?��8�9�����~�t�~u=qZE�5�����j�m�~���pz���v��0zx��r,zp[j�n�+~(b�5%�W�y���=�"xX)K��w�����s�t{���ilpy8zp�������qv�������tk��i�	}�n�/�����#��n��S���w�]MK�T�s�.���`��{d]��R��G	q��rs�M_�d��\�S&�8����6+�����!����o�f��T�nK��,N�^�z��H�?���{��\�%���8�����A����Z�;?�t���&�\/��=t������8���U�������h�|r1#9�Q���T}=q�t��}�N��Q�z9����k��x{��K�x�)p+Q������x.U$��f�9����8�v7�x�T!=j�1��pP��������t�oj�H{��pq2}�kM��pt�v���;u��K�S{h��P_�r������1����@����{��B��cM�k�&�I�e�Kz��l�������������G�k �������v������tF�,��N/nY��n/c��MYj,���$5��6��?T5:_nN�ah���_%!�ZM=X6]��%q�T8Sr���mA]��d�A$�����7�������EKp\��nf[w�0[Nu�K��sW����m��Esk�a6�,��k{x<+/
wu�p�'��j�\\K�.8iO��g3��.$-�[�J�������	\���&h��.�y��o��:�]��	��+�jm��d%�b�]�n��:�4N+�!q��.x��c�5�N�D5��ZU�8<���9DCr���<K\���_��i]s�K�PN�)WE�"KP����e^y����H�����a��<�&�����'Q��0���r]y]������?�$���6�W_Dg.�A_��M�b;��u�e�����a�S,>���J�$��Q�f:��������M,R#7������Kt����������YZ�?�/�s3S�����z��j�pq�Ul��6�8�������>��q���8�{<:����:
����p������,��x��:&��h��>KKbl�k6l"?��/���>,��l��v.�\�x�����W��r�����������X�w=�i�*%j:����i�K>�<WM�����*��r�
�4�;�����2�/�H�J�y����qg����54���uL��O�q�J�9��f�q�hZ���ry�Q��������_S�y8���8+sp8�����.��c�P���xo��g�������|5��6��r��k�Oy��|y���J��?�W�X[ds��U�uq�E[H������F��������2��*q|�qi�%�u�|�lq�Ec�����v�t����t=�{_z��q�>s�to��|?��;����<�e���|�;��q{\����Z��{�v�����9>G\�����z�X����UG@p������N��qyqP;����"�������k�?�_K<Dx������	=��rpO��N���j��.mF��)z�7��z�?��^j>�X��Iz\�h�.�]���4�H�lb:_)��~�:ww�}�},M;�B3�q���9Vu�5T�H��������������k��6�����*IKE7]78�39.�
\�������C�ca�s]�ltP��m�i`��y:w�s&df�l�/�����������zQ���"mS�����3,K �*0�n��Qk��O:�������,�
��m��L��,��2)jT�������^�'�����N�r��+��+��c*������$2������0���0�8d�]#"R;g���+n��t����Fu�����_K�T!s����L����,c�HC-r��}&���6)*-��Y6��x3����_��81�U������|L��%{��f���R�\m��5�i�lo4�Z�e�������*�<����t���@������}�.A��u��x�K\..�c������Y!��s;��a���St_=1�z����������D��|�R&o+o� i����F|M*�e��Vh��t����Z>��>�Y�T�cM�������
�S?,?6h=}������#l{���TD�h1�6"�6���J�&�4<g+!��d�����l$�}���Wu9{�]m�����
�jJB�N{V���k��T������n]l�	��m~��/���������7]c��v���W�~3<"�`CdG��v������q/8��F�N/�������Ch�;U���2�j������#�1,���[e��5�?�)�*=.
��{���������n�-�B�@���JB)���X>��F�Rv�%��P��7��
dUh�����7eR���x�����w����V,g�l�4K4C�Hc����t�P"6�g.��J�a'.���OEu��O5#N`	@�7k�����?�
mAE�X��R-d���<�*�^��<��+�^n.3	���<�OOM������5�P�u�m[�(�>TP`zI��_�r�����|�'N��9s��~�_�4�{�p��`��p����D��0r���%M�"�)m����v
�1T�Y*j#���lD`�9��'����h�������(D�MTE��R��5��i;	w��d�L?�8�����C@��3�_���E����G�������,�� �N�p������dYXw�Wp��`vj�vj���� ����osC�(�7cs�uJ��Rv���1�>I���e^�������-<��qK�#1�����d����K�\��U�
3�3+v��)`�"X�ms��n�����/8��^b������o�
�x��E�j
K���V��������������u����tV!�I<��o&4��[�q8����e�98���N)CoAC�������v,�o@�����UKu��}����e��2O"�e:@�G�i�{bGU�Q1��M�>��q����Q������'�>rC����#d��KjA�.+�.���I���rF����������������m���;qiV�=eT~zP���A��K@V��e������W���Td&N�j������Tc�a��T�VN�c�B���]��k�v�v�]U�{��5"������@��o�g�N���*��t��t��������.]@��*O!$X�z�D�&�I"�X�B���2�V���11
��d����9oY�XO~m���+i��������B��q��������*nG��DPh3,����jrv�;��N��m�lc2K�����h�S���X3�Tm�X���l�x���X6�����2����"����B���T���e;������
�V.X���z
r#M���o�������a��;3�����&|�*���}c:=�xj���D�������6��f�r4��%|]��f���b�9�����/���zI�e��4�{�v��?��T��T���b�
-�oz���������4���m������������{p���3uI���/eT��q�3y�p�����?��?�J���;L�q�W�Z�
�����Is������:��y�B��E$!�.ip�(��-b����dRY+�B�`>	���(�������D}t�8y�k�o/��)����+���!�@�<@��I�'��X����<
[�Hn��&��=pu�t�XL-������_�M~�^�A���5�`���*��Fu
����G\y�/L����k��7����WO��pO��B�����0�0�q�#�&m����6����N_�C8�u��2k�9��0l�O��M�����@;3�6��2���=����e6|��;N:�bc����"�����������{��)6�P<xR9��3*@xK"�����A }��Ky/�zV����)|�8�����"���w-��pm�Qx��a��8�Nal�M�Y��[�����nhPS�O�2;4��2<�i���		F�~�"���������������w���Y./�K��p6����H= ��G���Q�������:�����.|�J��]�%�j������FX�4�1��{�)���<�������]��-k[�T�[d��d5����J��XcA�LD�34KF9�wJ�c����9��q���I32�$U���Pc���$��L�����9=��9q'���,3i<
�1^�%�����N�w��e/��J�M��a��SXi������U>�"��9��\��9�������6����l��MBYx�sn�o�Iv���������O?R:��NC����+l%�7������A�"��^6���vDfF�|���od��fG��*�$�=�:��*�g�S�6����Asv�8D��p[��9�2r�m�2j���y�x�o0eMB��T��/m���BPg�9�d*�n~w��� $��G����+���-�8���N����s~���7������H����0rk��M;:3������`v{�1&�7yT���Q.�<'gpG�RT���xH����z[�����-�������C����Z���>}�\��#�QJ�n����%o�K��Y��o�a���l0bl��;$L�����|��7F\L�gX� �<�&�p,�$j���3Qu�RBF��v��Q�#��CL�YyO%y������yh�����|�ET����������dx��^0��9o��������L���^�����sb���!G��L�Y���?�����Ozch��������_�b�(���w/nGS���X�C����m��g��-�g��������P����r���P���8'�=\����Aw�9D��Z�A�N��Z�V����
��:���lw|mA#�-y��y��f��?��*�|4�z
��Pb��@#�!����Uo��3�E_�Lp�Mjg�s��=�E�XA;i������N��b�����P��V�^��O��'��
��J_v�^v���jU��[���^O�L|E���E�q��d{�bO�/�-'��pw��	��mL�Mob�3��>-U���v�����.��F�n��;�����Z��g�8V�Wu.~li�(y�-&4�R��S� �:�S|�)
��1�����R�
��9��8�P��p ����;E��"�����W�OA����/��0q(*�k!�o��0b����@�R�^�9|?2o2F�\���H�6)p}P�d�H���`�}�S� ?Vz�d�|ifO�I��Aw����X@���n�wg�$d���d����������0{�a{��'�=`��sUz��U{0O�xOA@����E����.�Y��@"�P����m_=�'yR��
�%����[�]�p���)��O�}�,~������m����I��wj��o���vu���
O���<>x2�q��7���^E-*�B���LTf��I����z���L��5��������1�m�F}O��L�������1S6\��nO�m�"$8i����#�x=�|~�O�0�����jn�=l��Aax�o���azE�YXT����6.P���L���"e���#|9��j��pA��H�$�����T�����24�F<���?�j�"��=��Z�gH�����
��X?�m���)�� fk���T������4'���r��F\M�Pi�9fa�EDz��s#��X�\����9�x����V�
K������=��DfFL��^DO
X��n�}�S��q�.�y��� k�������J�������"
�<��?c~�U���U=����P%
����*����h$?��
��r1'������y�[�y}�H�L~��C����7H<�Tl{U�q�k%�]�-A�S�-e��,Q �=����,���E~���.��w����������c\�D����� �����-m�A=1�
��#*���l���q�y��9j�����.�=�y&7CH��}��R�^�efSx���c�(6���:c)�<����?s{6D�a�+�����%p���~g����d�Gy��{�?t!�h����@As�[��s�A��SRF]1sz�N?D��Bz��t����&4���O-
�T���L��b�?������$N�|4B�O��g��:��ba���[��1������_*"&��
9����h�B�� a�'��#���N~�k��+��|�kW
�<(�������vE��|��2��n���O�I���f6�6��C��@�=�i�DT�J����w���~�i�
D���dM���h�l^�?eny��i���T�	�T��6 N}r_-ah�5u$��e�1����}Oo��o
5�-0�=�4T)���9�����Y�1�^H(��=��h&]�2u�����,�	�����Q��p\�KM���]B��-��)J�B��:��+]2?2�Yz[�������G�_)�}L�*��^�{*�����?L�oi�vh��y'�O-��\k8�-�?�~��dF~����|��V�n����vPj"4 ����>$GMjd�U6���6l�r���l��,�n �mC�D��X_����E-��������u����9(��Uo�L{�UY����a��d��dH����b��|��F���:���M-�������\������f�e,�������q�����Q��L�����������K��N:�x���H�ss���_v8}����]aK�<nj�I2Yue(��
"�v�
����R���3X@fn�Uc<EvN��G�����D���%&��3B�'{��}kk�e<���������*�	����h��\f�R��T���)�q���<%9�XC�T#�X�38{�+�����A>1���(�t�F�5���3�o*��Y/�T?��y�1���XO���~�����(�eVuJ�A�4��[�|c�LP�e+�����XR{N�������>�Z���A��E���&��L����}��
��`&���H��F��
/�+=�]X���B�[����U����md�-9�Jol�f&��r��u�j��h��Jnr�
Zc��lP��F��d-�[���gl�3�L*�~���tki�|
�Tl�Dk����r���-?������L��%��HZ����g\�)Q*�E>{���x�x	���_�����������a���������,g�	]��2�������\q?����RZFn���9�]���K�D���=A�y��������$%��������|�x�������<K<P��|��D�o�w_"��Q��nb*��|�J����-d��X#i�T�=S��v
~���rg�;
�E�E"���a^R����| ��:H\[!���z�}������^]���b��J~������6�:��r�G?��':*���?������������Ox�C�������q�/�MxIj���:�Sf��Q��U�� mL��NNG�������Li<��H�(7�9�h�t�Lh�I|}j:u���-Pc9�����4�q���y���8��g	���b�c�k�?e��(����gO
�!�_��\\����EW_b���f��Q ���#iA�k_��p�,���������<�@�N;W��C�
���or�i>������
�S�b��xz��J�Jf"<��i7IAO�qA�T��^������/�)'J��.5A��b���p>��_UT��9j6���uG@�U����cs�R@�����������v���
,��g�m
�s�<��v�)����������q��E2~������3AQ8pG��:�}����p�6�CIGc��I��s�[�;2��1�Qe�\�5g"�ug�6&��j%
h���w6
�0U��������$��������;4����<��&�4���K��n=W�F2������w;�����ap,��:�{^U�M������X��-L����0P���9[���	���3H\�s\�;�D�9r����nl�<���G^�o?�ZM�����Z���0��Z����������p��G���%��2��^��_����Kw�M� }�t ��*8�y�<�n���I=i�y��2���(Jx��Sm��W.P��[\�^N�F���|�73�EOZ�n];��@ON'�jc��������^���\��N\}����Pud�K�)��t����_�d�����E��~X2V����#�`�������6��H��U2�A����zN������K�d�B��O���A��3�CL�L�[�����5dK�[�1f�t��u�p��S`�5+3��1���*M���n@�����`k��[f0M=a�������IiCM��+�;���ymm��XQ�Z�R/������*�&t�3e�m��oqY8�k�������)������O^,�3^�9��mi=c�9�E���3�RoP��V�*�NDA��j�ZX������,��*m�C~����~����-CW3/���������sx���y����K�3��e��JZ�<�^��/�~�������c��?y�����>��*����h��
p�N�Zl�]�zT�'
Mw��s�m��������MT�k0��t�_^������{K>��y{�����=>���=���!y�����_!���{�'|� ����oK������o6J�~#1�����m��u���v�<�Q/Y��b�x~���*��������-��������B�ru[�}��s�>�W��7����}�3�?BZ_��H�.�ST9�t�S��R�c����:D��_�Nf�DU���^|�8������\������
:�����9��p��tvY��ZyE>���e���1�����?T.]�f)�H.2��g�+�@}o����7����U�j���Xn������*�\������-�8Y9��IUjq�' �����������$Z�����8W��0G�����G<`��*L�}�9���jQ��_��<�����)W�,��Gg��l��9��r"���R8�=�B%��!u�m��VN$z��S!�y<���LV�cj�D�h���Ty������7F)�O�x���{����Ud)�`��P:?�6�?���������g8�"����tWD��t�+<����>:���X�eP.b<���q��dm�|��-C#�7�"�+>�Id�+��|Q4t�r"|�����;�U>��!Qo�0�;����<"<��MRdS�����m6	dL�����Xx��<%�k%|�z�K�lO#%f��&��s�y&0����zlA��y����!'�)%o7x��L���q���>��Z�������m��b;~]9`����9�s��X}��'x��+�G���������cK	�2	&Mm1����w<�p���S��wr9����>�gwqi1[��q����!#����U�
o^�}��O���t��W���1/����w��0����������V^�Kj���i����b�����2�����R �Q#��f�������`���{��t!��>��c�#\+�3�vE���kwk6��l+�
]E���f8nTW�VY�j���`a�I�I#q��x�l}O�n��o���p]�:��D����?�����e��"�L���2d����<����������~_-��o
J���p[��2��4P��MP����KD����<s'���z��;��c`]����\f�2Y��9?>5�����T�Nu����=��q�y���\CkT!P)`U��2�n,�?s�<���T�a��5y�J��IC�Ju@G���x�6��U��F����DD�h�����:�F&��e�'��]������������7;�8T:�{��l����x�6�����v�Q0�htm��W�F)��!���p`����c�
��a
�������B�j?u���NTe���U��|�\��l�$Y�i�����
~���q��	�B*��P�&j�ao�6��Jx�B��������at��T3�������U��Y�@i�����bMga��mg�3��~�5[��>�^�2E�l�:!���,B�,��q��>�%9��x1
:H�,�=sS4j9�R�����&'z���/!��6cu���C�K*���^K��<7u�ja����iP��t��i����G�q�����N��E����������<����u�!tOHx��,�5Y��i�i3�J
**:z��E������m���Yb	����!���(��_y`����+���k�����"�B����R��,C�g_�R	!�}�e�3�u,�${��4������<���7����s��<�u�s��z�[]��D�����5.����������Y�T{^�s2�-��`��0��}�����g8{X�cr����g���`O���T�2�VD�iE�Q<O��:�CK�U@��K���`N6��{lo*��U_�c����+t�l��'bd���)�I�Q����!�%b������k��k���S������3rRN�~8g�����)��C���QDx�dO?�]����4�
qcHK)�<w�\H�O��-WSY��t#�^�Zr
)��Z,��/���Y|	p]���%�
r�~�SRG@�9XtA9����
�������D�%��n�_����6��M��f�?�YC�����T���HQ?.���m��%�%|�����.>��M�y���*T�z�_�aIe=�D���(-O9<�#kgq����5~@Gq3����$g^�Q�%�F�:���=���!���T���b�#���.X��C�zi�P���_�m;�1&+���i�R�b����n����+F�
������d����%�����g�f�lT�N�7���;4J������Ud�	E����/6}�$������Ly����@�[e���Eg��KO���F������8�1���[�o��}��"$�����V�A��IHa��e�f��fr��~�K��-V]�B^�6��j�ny04�
���!�2T���5��
g�M�z�\2�`'�w�X�u�0�q�Xy��U�����J��K��3p���d��F������}}K�v%�G�P���`l?�Z��km�K�|0�����MI�p�������v����1��>��u�.�Y���>���V��� Z���O��G����e��m�9��@��?���e��{!dR�\�/	�k��guX�4��/H��0klO����J��Ce����g<��<5�J��fGH����r�����m���"��:x�I')N�
H���Q
'�"��IE��!����j*���C��U�� 4���>����=�7�$;iMgS{���q�RL��Z���	����D +��{��R��M'9B����pcDo���P�m�G��q>�{&9�B�I����|���u���|O��=eq��_\mM�B���������Q��,e_�S����������c0����Er"c�1b�g=#���!�k��"�����5M�S��oE!U���5���p9/����/�����[�V���f��#0<lGxN���F������=;�i�_dFQfD�����O
?g���7�=��N;�����W'�j�2����6"%/5;jl�E����%Zb��! Tp��84�|�"��soi����)Koa�z��s��o\f�g����q���9l�x����������3�'���ak���}w����?��-���a�B�nUE��%�8^V��2�s��:���,�w�����OJH�'��`DR&���@����V)Aq����%?.|�t���b��e��������������S������/Vw�[����5�->OM�4��: g�'.��RXT �%��8�Ia:l����V��K�8��O�:J't��w��4�,2�A�=�U����/����n����t�(6�^dc}B��o���G\�w�J�~;|���v���4�9_�6�3��ED'2�F�K���1%�����*����i�"m�^����
�"�o�>A�j�/�D��-8�b��[C;��=%~V�L�G���O#O����A���wz�_�;��F���g�V��C���v����A#[t�^���]�����`�C�����Y4&���Z�m�j)_/���z����By$�94b>���*��h��u��W��m��DX�g:/g�����|6�e?�
Q#���r�����o����L���?c�?~��f$�L���<G������������Y��b���_�#o���l��e?�u�����aiu6�f����&C}+�b�-}��YS�������<���� �� ^���42������0�VJ����F������tEW3${=7��k�������P|�;%������_���3�L�w�
��%��n���i�X}f����[
a�&������9M�����AFMVO��7��)�*��{�@,p��[����w#n%���#� �1C�4�R���i'd��uFO������D�%3."H�ow4��e��H�
�~����,\O�=@�Ks����V$]Zp�Wg0,����o��rB���M�_N�sn��I���*C�_���0XO��$��|������r�_R������:���Ey�H��Oy*��m��(%vK����� k"Az�7�H�]#5y|���	����{���:"c2�����7�qD���c��lp�d���5��3���aK�0�V�J.e�*n�z����/
��)���
k���:��-.��CO���Y�
h�R%\���"��G��>�J�Z�(5��;�I'������
���
����8Z�;cEk�0�	6n��;�@�4�� g������W����4�K����1��O�Hj{�uE��c�y�
��r�����L���=����?��j
�����g���3���;��2H���Hr{�8��y��&rH +�][��������n_����
�G&������cg3>�Z���I���O]����72�/"y	o�{����g>�G8���L�Om�tX��E7���`�M�'�w�����l�K��/lN�a������Ys���Z��uXv��������I������v
�Yd����?�������9B<�7��]��j��>�y�,���?`OV�<�v��7+%x���..lsL3tA�AV�mOh��2@|�+��v��(�R	����tE�������6�q�(���fp�7RS�I�����#�ep��g�/��G��/��Z;����#��������������4����F:��}�b]XV� m�*�����W���rj���\��^�_�[�,7��,���2wG+ ��\�������/"Th*n��	'�n���B8���dN�H]�k�Ok.m.=����"u4{����5��p���y�q��L��7�[
���,�$���*>�p�K*���!�mU�O�_�t1`P����d�O���z��t�!7�R���rrJHz�*�`�f���SY%u�������r3m���|r
9�:x]v�����K�k����KR~��n~$���k���L����P=G��K�9��V����9��7����i)g�Q����R ��]�T:B��D`�/5��<t�.a�
���*qs�����������7W�_��P�f�>G��2�H9�����H��w���O���i��L�'p���������qI�lM��d
�=���q�t�B������zx�jwH�L�i��d��9���s���l�<���K�������!m�C/�� ��s�3��%��,&�P���)!ka�|s5�
^8,��DF!s��'i��}����zcfS1Rp*�M�.|-Nw
R0�L�o
i��� �r�O��������A�L���Q_����z\�O1��w��S3:��[8�dU��z��P��)��vw+�e��"��F��|�����s�
1v;7����67��
�~��'���7>���eeI�a�}�j��[�Z����5��|�k����g��n��f���O|ryi��W��BW���Y*�asU�F�A���N����AR��?�bXST�����:�S-�O�����N.hF�tL�a�����+@���}Vv�~S�es����M�}o�u��zla��lA�z�xp8�
a(��]^b�`�u�@��a��.��k��T����YY�_7!c7Y��l-��C)DW%���E�3�-�!+���G�J����S�T���/�5�h����>�����92��EO�����/�7&
�s
���S2����������Q�L&ti�S�����$%����^�*$<q(��E�����d��73�Z�:��	��=@���'��U��V�v'y�<��J�!n�S8�G�i��u�4��|��I���L��"J�w�y�,�$�����N8��w���N��l����@�9�z�e](�����,���j:����6r<�x���j�����9���rj�Hy��T�Oq������'p��bD���D��2vW�W�$,=�pK���{�m�JMJ���3��c����������+w���� �����5�7>�����h�{t�5J&->l@��(���B�l���8�DF�G��l`��S����9V�D_H�����u=g<�Xo���y��7��x�&�`�z�$�Jg���q�������2�B!��]"�/�n���!O�$z��,S�"��;�_�K����OT��������J��`�gw��A�5�]����]����V��=X���C�-tN��,l������D��C���	�>�b����]�V5i�������>I�`���;���3,�t�"�d���_�N�TnO��mz1#�O��R��4����������x������{I�;me+�,�����b��t���v���z��V��'�*J��E�Z���o���8
�4�,L������#,��t���R�
��
nC�����z����[���ysG�Fz�����k�"���R�������W�K���l���%�8�J�u&�R*����"o��#����LzK[%�f�
v{�dl��Z[�4W���h"2����0u�lEx�
m@[_���6#�v���^�A����f&iD����*
3��EQ��x.��G����2�r�5��Mt@zQc�XU0s%VV�6�H�#iu�E��T�|�^�,�[������w�
��s��z�f���
OG\N8�Ov�Awb�s������A�k����k4�pj7���	1��(����j�
,R���@G��0l�wSJ�[Sk�0���gD<�����"�0�r{��Tl7C^GCmqXGec�!�^\�����l
�b Y6��^�G�t���|,{���dRuqA���D��e��y<�W���������@�C�hZ�#�m*�/�����$�^��7�g�n�c3[���:vY�D/�J�n����sA��&�`�.2����.��K�����xOf���B��������]�������z������P�l/5E/���T;������`�)Cem�
��.d�!<x5Y���N\l}cT� �/�DL��]��pm�hq��6A�J&��nv�����DWg��@��n�������.��os��{�<����h���	l��i�;jBc^�Z��J��sP��h����A���e��s���Nd��u��S�q������Ux�*[�/d���"��w��t���j��p�8�%��VYlq�
�X2��%3���l����K����T2t�~�W��������)�g�r�]��G�����g���i�O:�k�*��[)+�������g�R�x�!����_���{�
���>	��7�J���k7;��$g~�-��"�{��tj3�v���\p��3���cAv�"�����b8���������b�
]`<o��d��&��-&@�C+��{��p$E�U�$�XU��<un��H��,���%0N���x^��l����������>��A"�/��������u�f/`�j����-<v��X�A}#�kP���+�J�V7,�>���A�]�c>s�]a��:0�����0X�c�������_JU��c!����^;�>�i�v��
��t��V=��>5I3���������fF��A��l�`�a��$�T��k�.�hihY�1���/0�o�*��y	�h-*A�6;���j��!�@��5g�!i�\K��&D������.e5�~�b��?���������F��.��#�f:����1!�\/�hQG]����a�1����7��K#�G{��q�p�/Y������Y�z�E.l0�����
��l������M�N�%�
��T�������Nk`��C�R[��C�$L]����@w*�9U�)�6��<���� 4�Gc�u���7[W��h��'�_*�A�&z-?�S�7J�j��`���?AQ��,Z
8����:6��_x����%q=�g�����
����K�����E+��h���l�����.��t���0��s���v�t���������'.2_H�D���Z�ge�;[��B(U���"���������3�Bu�ArU7oKu��x��g�9sf��buf��?��������D�H��N�<+��(���*.��9���=DY�x��e�7uY���@�*SOPV���:����(Ec�U)�t��z�����C��k
z���n�W��5�a�r ���c�p����k�
-��@T7�i��w�c���`F��s��P���YD�]��z3��V[D��o.[�V��`h��`��u5uY����Mr�������]F�����M�n��5Z�?y��
��a��YU6�{==�����Rt������&��gdm��z=X'�k�&��j�tM�;�
T���U\�$F�������Y{1��
A����&4g�h����wr�G�x?A��zC�fx� ���o5����T{����������<	g����1���=Qse(`�V
�%�C�>���,��C)�a��
~��{tP������@����'����pZy��7�X��OHr�������y���]"�4�8�	~�[��{p2Y���Pt��������(i_���6Q�C��K:�'}�R��o.nG��7��~&RG��������2�k�'�QG9�������%�PUb8~�	A)�$l���^P�U
+��a��)�k��J�u_����C����Q=������O9	^�2�o3��iY�E��<H�P0���{@������a��i�����(�h���t��.�
]j]��j@����:�;6E-G��l��X�0�Ih��+=���Af0���uj����/���"��~�($?�/���Hq��/�������gsJ�������U5l���`�U{k�����#L�$?����P|��g����g(�.�U�/c!���pg�WK	�����e<ry=���
�N$�� 5?� �nz��8^]�<?�0�qN ��v���j��w��������V0�rTSpeb{eV�V������]�;�y4B��7���
��8�|�I��O���)S�'��%X�~#�j*p '����_������a�{��I�?�v$C���
A����\^
ja�W��-�)x��0j�=X*��M:����;E����X��8c�&�1�eh���ZD��N��w���Y%u�>���R�������u��E����Sl���C���Xp���r��r��N��y)L�"����1�e���r���5���G�G���CAz7D�o��/�\�/�ND�o��Uj�y���o�y�{^Q��J��/tW����+���|{#�	[��,1����������A���EJ��C��*5t������Q���Bl��8�l��8R�n��'�.��Fdb��j��C��Fl`��"�96�33����j�o�����
q��jt�T�MZ+2�R�P
j��f������rsa�v����9�E�1c�|����&P���M���%�Z8
������	h��e���'��q��Fpu���jlc�65u�s���i�D��E. ���[��N�������2��`��|J8��������s�����6��M��b���;���&��n���Le�LU�g�A5��+]v`]D�K�h�s.���������n9>X���[����[�0���@h���BovNu���!W��5�AS�Ua����r�����n��6c����]��� H��A��A�OT�U��m�j��h{����sc���l.l\Ju�Z'��
�����?N��>X5y����q��#�P?�I���� �q����b����s������Wh�1�@�kP22s��-:�OK=Q��4�T�
�l������\�D�:c�
`�N[��*6\����������v�x�
���D�7��s���Ce����.�A���{:X��~�Z����\���v6�`�%���2��v�"C��`�$�=X(���Hdhc	K�,�����m�,������!�2�d�u������v���"����������i� }��Y�h��2�� ����P�B��-l0Pw����S�Q�����kk'�+=���������RMCh�>�>8]������2�id���%O���8w%W����SH]6�Q
�u��a��S�ndpFv��U^���*K#>?K������������G9�%��F����%�����?������K��g��q��$�\���\-���5�"�n:$R��5j�U�E;�A���n�t�9,+����1����s�&� ���,E��X
���-\�Q*��"��+u��������6~�)8�����6q�wM�qN���5���o6��Vx_�W���� �/���Nv�ZadC��F�y���[��
0 �{`+T2��\l9��zw�-|C�lW�b��+�,������\g�5���A^�5:^����s�7���J�-�����w�4�5������iB2e�
��h�]�ak��S$=9oY��b�yIM��|u�m�u�4�_��:��:�i�����(�i���P�������m������������m�t�!�z�%�������t		��Yi5K6�I���@��s�Qc�)�bM�Mv#S��@�P����Q��z#�e����-����$/�`�=V$:����g=,ljGcj�s�:��#|S"2%���<�P}�����iFq��^K�v?�E�:r�@'�a���
��,�����{�����e��^w��xV���OOG���~��O���n�+�D�V%�7q����Gn}vP
����`&�Z�
���y�.SI���x_������aq�����2[e���^|Y\fIY��~��e����6���������?��;�s���_�]���K]�5I3��
���x6�O���������.8b����Q�-�*��k���_�Y�Q����Rj�&)�S���3������2X�s~����;�zk`T��3r�=-.��PF�N\��|~�OSC������u������.|��������&���H����&�>F/d`����&��Q���v3kC��i�/�w��u0M���9���}�
��}���hJ���A���S9�Tr��m^;��S�}�2�O3&����Fc�JQ�V!6T�Lu��@�y������^b�Y���~�����)�[+�"�7��|o�I>h���8�������t�#����Z	���z[4�R+�o4���t(��5����d"����B��B�5�Z���m�!Q9����� M����b*>C���4�#��F�������o��F�k�z��������,V�R������?-)���{�a�)���?y���!���N�)\���B\v�n��d��+Ex�9o����c�����v�-_a!=#;��^w:��S��)��D<"�%�gi�������U�O��A�M��!&DY�x7�%��iMV�8�-�g������3��"��=�O��=����t
7��,V�;��������L��S�g��4|����6���!�i"{�&��v����V���m*�z4xZ~j��?���������~���d�z�H��o���sE������/�<,������]U�!-^K%���U<J�b�C��X��Ref8���_u��=in���4%�4��me�+]}���5�|��uTV�b�
�+�����0P�c�aA��H���^m������X�M�o����IT����7��%'���'����[�h>��Y��/�V�j�j4��Z��g�g�q#i�5��]����G��l��x���#��*jH��^p�G�{V7�"w��@�pe�������~<�1�i�����6_����*���G�$'�9��J���m��r�������2z�������>?��]h3G:8�pmi����Ma������]�����{�c����T��Z��<Y?���>��{�"X�*���p������e��5��;������Y!���DC{9S�q^��$������ �v3���v?��;�Gh'\�]�L?L�����h?����T<���#�DU�� �~�.��m�\����k]:�F�j�"����LEUe������M�yT��z1������
��2~1��j�0us��
qV�D�����3�.�_|a����Y�6�H����e��V���en}�jI�K��L�D�h��j�?�(�S������>Kx�l6�v��co!>]t��_�����(��'�I���j���g�/RX���������<R����j:��{dW�+����������.���o��L���1�S�~K�?��F��L�$DC��p`*�����Z+�����v*zbjNDL�
&W�h��e����;��*�8m��"A��4k����;v���B�Y,�ZD�?,�t������������}�6��06Md������Q�>="5�����O���D+[���+����&W�g���#��������������a%�]�ZmJW���~���	��
��Q�*�!bh�7[/o�r�}
�pel*z����< r}�2�(��G�Q�k~����HI���DS�#���p�!:��<\���a$�k��Q6(��&��d+���hs��e����.��/��� fk��"0�"��b�)Q�S��N5�;��4w��&�&�ai�{�I�.���$nz�K����]�e��q*��F�^A����c�oT|`����G2�N#,r�x�s��C
/���q�
e������iZc�x�&0���I/�Z�$&.�s���o�jD������p����gn��������
��+�=�����2��zXdx��)Q���i=n�M$�K(��=iM�gW"<�C#��Z5ig�������:�
%=�_�TJ�a^s����������$ �m:y����~E�81g3��:����w��� �d����D"���k@�)�O�l�1R[�9�t���6��@9��^�_�F�� �d`VwlJ��=��x�D����%�����CS��7�s^��������0;B���p��$(�[o`=r��Z������2�������5'���ro*o��P�S�k+�[��r��.����K(�K�#�K�����b�������I�TU��������w�������q��
{4�a������7!ve��J�{M�`h5��T���YwM�	����+x����}W�V�}Hj��c���/��ng8��	#g^�u%4�{���r�Te�)8~>��1|/{�������)��f��G�y
�q�������"��.���T�c��E����$��6
\m<9���@�qN�H�Aw���e��j�/���s�JM���<�kOk�5�Kh3���l�oJ95�FC���������I�pEgD}�S�I����h�<E��?�	�;���=����?�w��S��q��Vv-`��1�Gi?h�:�)��M����GM��t=����9�$:=��H��8��|�����73��Wu@��.�O
����=��Q��G��[F�+G���%N{�a�_��TL=�����c~I�_95@YJkMi�hs���S��xcS?��?/)h�8��KX{����#�Od�b�������G�K���I_��#=�kc�7x����5�|�E���~���X��������"p��2�����������h]f�Z��m��u:[~��������X��h7Kr>��!�,�����H����]B�b��tC#�G��jpI~St��Z���8g<��t��tn��C�kD�������x���v�d�b���������J>�%t9�fS�����}�1m~�*�BS����p���I����c�\�
)*w��s�3@B�B�?�U��d��%v
l��.���r��� 4o�N�}pgq;�����!���\y��:�{�A^��&%�a���<vg��O��b��������l�n����J����a�I��X�����k*Z����i .����Q��x\pF����!��_S���^�u��%��lQ#�����a�����asP�/����T��\�����4��)��	�������X;���� ,���`����=Y,��os�����q�
������;6�v��|WI���p��?�@]���<��7e9�n��aR/�3�8�����p�b��5``sL�q}C,��XC�=�i������=`��z������q��
6��'�q���z)^���?x��	�E���/X�������fk�*>�!L�?u���VF>��SU�����L�>��d����/���!��B>����m�w'������F����������qu�,�I������Z���7�T����������D:L���Mr���~��7����������`�L�i}�N��e��g���S�%�����5dU6���{��M���)]���dM��������dNE���� �W�>��B�A</��/��mOH��	��z���#�i?�6��Q���A��=a
t�L��X��ye�Ob�W���-t|?��wN����sM6p�9����`��Fz�p76D�]�V�� D�fam�0("�1wo�����{��
,��i��0h|�r��Ym����7X|������>�/��h�_y�tUZW���.��
F�9,�����E�|}�2f��� ��+r��������3]W����HE���7�+��4/����x<(��y�C�r�������L=�o/��X\$>�F��r\���0q�]=�%)w���@w`S����`M���c�8���8q0���/�j"@�Mg<*l��`�z?�h���8h��N�����1�n��9@�f�$R_�c�J�@4`�������u��.������1�v�q����@�b���L�B���W�
;3��X_]�$���8)7:�i�{v��Q��>:�x�8����:"�"����8#8��M��R��|�2��@]_v�E�g���X	P.5�J+��Q�V�T"��*R�@I'=\���(P9hN��&^���7����!Ps�{�]��3���r;�� �`����7�V�#u��A��uYs�k;���5�:'�����<��!�Ii�c"x�;K�`h�I��9IF�!0���T������
��?���l��J�
�8�I�����_g��h�MI�90���c��
)8�8��s��y#�ZCs(
cK���1�M�>��:��s�O�Z��d�~����%������4��S�� T�yQ���5��������������u
�}Zs:
��1��S:�X�t��+��}���?�����n_HQ�1 F��h�����e.����������N�b9+�<���d5���X�'��b_��<�
O��{�V���y�Q�5OE�N9<�j@;A6�2M��F�kk��W�$���o�.�<�����M����t���U&D�(��i��B�/8F�0[�A���MSA�sl�8<���X�9�f�R����<���aN�0U�qb�3p�@,�.��5	O�&2Xf�����]�uh���tD��waNV������������������Am��,<��}�E��<���f�O�8;���yv�R�i�1�
|5U@>}���^�7����Y���S�j:�<[���{F�;����Xec�TCy�_�C�K<��_�K�o����5�E��V�!Wk�Kl���%�8H�|9}e��;����Z_1:��}�����
g��6�T��w� �':��
��
���G����9��S7N.
���������������_�|~D*E��"��t���l���������H�*�y�L
-��l�	��>��������=eO3�`8f8�(UV���:2g!�;����������	(������5_NHRl�j��Y���(��z�Q~���E��������t&Us
����E��b
<�'�/���E-z�w��w0zt���!o;a�#%v4���w��8]��C���o?(tCjk���``W��h����?�����vk�X}]*�L����<@�B�:�{r�_��o�U��%H�sf}i�J���n�fz|��������S�`m������i��~��a��=@��4����u�Y�;H9*KV�\,;�q�������*|��S�����G���-�lHq�,����"aiO�B�+ A��+q��c�9��
D_�r����k3D���2�C��U�19����s|.������~��p������/��TL�"F���`kR�Ny�=��>����I}���N�6b�
K����7I�������@2�.��&�+�����.������ 2#{�K�Y��.{���`�`qHZ��,*���A�4����O=g���������R0P���X�����\�Y��n��r�������U^k5QB),@��UvJ���w�*��q��iV��%FR�o�j��V&<BL1X|��u�n)&��S%��������D�q{����F�HV�:=nK���D���E��"�������s�T�6�
�����ntu%�%V�����W��RQ�G;t5!��@���ADI����������b�������&7�������~D���B���p]������������N>�p��"b�7Q{�����������y;4��Yv[��`m�'S\ps�8�V	!O@3�v�9��oNkZ�O{[���1Fh��2y��DDSE���4�6���Z\���'�����U�l�2�&,�hAGH��%�Dv@��Vk3����!�kN�X�Av_1�G��K���E����N%�D=;o���6p�ha�����W!���'QH����Y��>�P��C�����Q����&��^js>O��Q(N��&k����X����FV����7�!�"���u��H��!�nw��0kq0���!q�����#�y�t���if��Y�=Y������$w�c��#f�!��/���?
�HBI��aH��t�f�p�����7�B|p��"{����M8B>]Hr�B���z��r
�f���Bc����2�{���s�@��������+��S���D99ep=���I4)��x"D�u��h�~,�iQ�����z��
:���D�����U{���@���i�3 4�t�P�
���9&yx�,��4:Z���|n*��7��	u�5�U,	�8��L���,wrX�u~@?>��9Q�����7V�0s��c�v^��(":�?�p��6��ZOY/B�d�l�
�����;�W�������Mw�#T�|�,{}w���G���F��r?�E'xC���(GE/�N��}����F9�tq@�f���bI�`���7��c�^<�\���l�r����rH�>_�����}��s_��v��y��V*;xH�K�������z8�����{��0��'��w_�����1
);+��fk@�����e"�o��>���S�&����-}�%V�����s����$[L��^��K���y_���z4�{�S�7����������,Jy{���xiKx}����&��J4$�'U��MqM���A��,`����X�j|���5�.Q�$�{�C^��!N�����}2=��%y7�Lq������|��+K�^p��G��Je�Ma���v�p�8��������M�ZW��3,|���q>������G���1����`��,)A�ds�����@�?V����1#����I���dN
�����jS0���;%=.G�D��r��Aw�^x���)z�fL�CF<#��2�����|��!v��S�'�ed--��#`���"����}�S�e�e��RI
;G�U7V�����:d�z��(Y}��I#,p�Z�>I::����R��o/?���	�G�Ms����p����v��|1�������1|�x2xPi�(��M�e��)Q�8���fR�F���Pu���
�����Q��~H����>���S�h���P�CO
$��w{�`O��p�����A]kN��$Rf���)��������A=X��R���S84���I*���O*�}!D�K��i���6���s����G9�/�����l������k��k��������dn�����D<�M�����Wx'�{iG��4K����r3�����~���l�!.`+���������
�
����u8��O,�M<�1R�]~�z���*k��A�N:������l7;/���=���o��M�]���o?�4�
.�� �R�����������Gv9�un/��#����|�P9��������
D��Z����?{��sU������r����>v�����^D���.$E���mM3Y��"7�
%Y8�lv!���`���T������M�,g8�h�A
�&��^�aR��q�R<��J������IU���.�e�H}��rkY��2;'�m���L��S�U�!c�,���KC+�z�~��C%����9�M!�`�q�����+?��U����yF���=��g�#��I��>�X�����
�+�2c��i��Z���|���Q��&��zK��������F���)QZ��������j:���[+���������d�H����P�No*��S�x	qfL���_z��d�I`J]��q@�j$.����� ����\���	v�y�������Age�3���������_M��Z�b��<��6%��.Vj`!���:�Yokcw�����wth��2�A�Gx���%8P�|K���}���B�������8c�81lI��sr����T����?�+��<�����7�<
,�tJoP�ToG��
'#/[V�a����`������V��������
v���G�+�#?��������)����y��z��+��
��@���H�//<��������K��
f�����]-s�[��Qul��=����;���>4���@D �1���
�Q.Zw�D��T5��<�J��<m��mi����)"�F5AE�����(�Me����":����yq@t	K��%�n��������(��)O���`�+4J9��������aT\��AE�N��0;�)l�~���M������l���QW@�(���t����j��Lf�*4�{�5KN��i|O\G|)��/v?33Yf/-���}3��$"������HsN���+V�ADse=��ro#��p���;j���"jE�4��3���bo��\��>��g��I��f*�k�8�F�*���z�q��[h
s�V[x%���Zz����(����'����*>i�OQL����P�HV��8wh�����w+$�!�-���s�G({i�mQj������6���g��E��k�u@#��g���/��7!6��.-����XI�<���I��O�|��C���ij�n�w:k�R��l�D��B��J)��*�V���,��U�������������j��g�e�J�L�������4��">����U���F�E
���[��dD�p�D������M+���J���0��;�	���>u�z��7(���z�W�;�2�;����
������W����k����>wF�`Im�C�$�U�����1z��	��q�g��R�.C(z]o��Dk�������W�H�-��g��x�>�;l$���'�����s��cL�Bs�:��Q��}���]�M<9O(,KUF�OyC\R!l��+7��.-�O6��#C�s�	Z��/���%�B�k�/+:�D�P�4���E��]��h��}DYOK�>�!��S�&{M-���~�� "K�Rqg����6����kH"��xS���<�����m� � �r8**�bp�����t����*�(k���65�$���������-`�������A��s
*4K �$�E>�J:h������R�/�a�a��v��'���)�W���`8���45�u����
���!�����S-,�T�+p��{�~��@���`�[V�5Q��)���"Q\�>~��.��q{�+�?�G!��yRO��0����|����2g�����Y��O�J���1Q� 9����UD��:�!P���7��Q�����>,?6i��t�j�������q��.�|�s"O
p:�{2�j:0�0�_i
��I,p�������2�-��w�3�����������GT����G��I���k�F&����F�v��H�S�������������v]#"���i,4v�5/p�G�6pP�d/�K�8\�M#�IF|8$aC�k��ndk���p�v���r)�@[�k����� ��I�?|_����iobgK(�8=�����8\H������n
U�y���v'�AfZ��Ao���P.�r7�B�R��'��q� �7�t�@yR����w�
�������������)"u�3o`)K5�L��3�pj��*����#C���W	;!��.>����w+�n �����K���_�^�I�.��Gd��z�wj/"�|U`�0��������{�^��z�W��E��%�����>��ev-�����}U�{z��y|Q����P���o3���7�\�������R�B�CQ���R1�	5g��#� ,z�\�/>������{��-��S���)��+)�����T��_��S���[g���LG����\K�.]!�B�C�����I��)oX�h�0��,�������%��I�f���F�����t�.�����h4�e��g�Ahk��=�;C""�;���7�z�|�G�����^��BH{Bp���)�������I��[�J������6��<��n�C��sh���x�-M�p&��hm��d�����+�,���3���%�y,��#�!
�6~Uw%;�Z�ha�����}�������]�_�L�u����~=k^G�38�k���^q9�z��������7� 9o4R���-��pQu�DR�j�y�d��V`J�B�����g�*���>(q+��14��k��T��AV�{�h�"�*��C����3T��`���JL�(B����x>.Xc������,��������k:��HB��ND�w����	p;��Z����������������Uro�:�pD:K�������,�j���L@��7��RE�/�<��1 ��B�|���������Jo����j_X(��A%���I�)/l���A4���cN�����r���'H�$v[)fd�Qw���&_[������mp��S��,6r�8�4i�4����!�O���z��D��AP���n�R�����k����j�J�H���6����7T���^[1S�����+���k���n/s��Ml3��L���1s��Bkc����'����^�zO�������s�G�7�H���0]M����tN�t;�>q��$�G3Y(���U9t�=:4�d�(�z�#Dr�nE�p���6�+����������T��}�{��L�%��?��:��m�R���X� 1�!N�^Q;vkO��w�k�F��T�j�=X���O�k�aA�r�:w!����X�9OR���A\���csln�J���Z���Y���\����n�5������s���2Y�eo����+�����N��@�(�)�4��|�'G#$Y�}�����7�����9S�cO3����T{�]��`�~�7������E���V���M$X��/h�A�6��!�F�2���$���Y�[�3�p�������m��?���J��,����i�B2e6��%�`���rR'�����]�b�Lj��v�6�nl��G�Ev�����}U_i�o~����k$`K�N�~���<\S�#���$u��������K7���
G#����X�x/K9���������j�'��vw���i����^�oeS�[�����NX����NLiW8:��q�E@b�k�a�����5*�yp"\m����9�3��%R���I�Wz�FJ�U����|�,�o/�h_9���:\@"���<k������C�����0���!1rw���N�SMk�X�;t�8G�[|iC�d�d���]����*�G���M�8z��;9^S���Z��+_���~|(��5�;C��$���f�I���0����K�����������
���^�;�.xU��c�!"��%_��Y�� G]�l��)T;A�
-8N��Y���o0��V��:�-#� ��)H���e
&xT�/&��������B#w|JAL�x9-��W� ��C������d��`q���p���p��b(��e��u�x��H^�A
gRt�FxSh�uz9�������1%$����<m�;��6p$:�<w7��v0������e���C����e��gH)���1��S���Lyi���Ryx�T����k
k�v`��W�"ab����v\V��lzI�r�������H�o���JOB����/�I��ny�]�����WP|�6�J����tu'�,��l���P�t�!�7�u��[Cm|UN��! �+s��FSny��3W�7��m���k��xv$�w��AE��S�����������N�X�T,����L������I[{�����S������E��E`s�}���nnP���m���v|<��7 �����C�{Zi�?o��Km0A�����B�{4�����r\�O��rx4�[����nV��������L���(���^���P����������Y�/�N�Sr�)������
�K�����Tb[O���=��@���|R�Nw'$Uhs��:Qi�9P9K��b_��R?��-OK�O�(��;	$�#��;+��|��c>���k*�M���:�f{Ix=��E�������#���!������U�]P��o+t9�%�3���^���h�:�����������:$��|����-�:�n.%��4����_�9y{Oxv>�/�IV���3�J�j_�vl�[Z�?{s��������|�m�?"�����j�*Rc������A���V�=�s:S���|$�Z������O��U��n����#�#�v�L�Ay[Q�:!��!����
���)�����C9���
}�����B@��+�t�Htt�i�\��M�
|O��Q�~ Nrxp�Ms��s����Y�b����^z�������6Tz�{{EfD
����_����	1%��8���g_�Y�������'���!1n��Vc6mA����2��p���iR�o�@IA!����635	`�zC~bhEw��������k�_��@RV(w�T��!yI���toB1�q����y
J�cC?�mF���W��cpL���S��+��}q7D������ps��z��S���5�
{T���#�X�,(c�=+�q1R���M>`�u���4���~��������]�� �j5�!��#����(XTgI�8DH?%,��b�a~m�l\j��gN['"R���s3�b��s�!2�h��}����H����rg�r���2h�������!Z���
=����ONa�Tt�J\I_8�Y��R��e�V�U�	&M2j�t�%��
���u�!���:��j��@�{j����k���|ua�h��wy����8���u;�,��Zl�^;���`��@��K^�_���y\��`����u�����DLb��|7`)<�R`�sX8� D��i�����D�f��V~��:�]�h�4�@���!!�5�`6;M��o���~UR�`��[��HZ���*������yU�h�!�l�|%�|�gO����m�=���/�������FH��>��?�hO��8Ma�������������[�")6��GL����!���9
yM���mE�P�7a��7��$��7�-��k�tx���R�2�}�^6���Ixx8�<1��+��hu�m6#�t�N�E�H|!z���n�&������UO+�R�
�)����z��h�4�
R�^���C�3��:��TE��<�-�s�<�Y�����TD9k}"���]���{2��.r3���QE�2]��Av��XO���N0���K���t!52�>}�2+U�v���Q�U���8���x���P���j.�*�����9�5$�M����
E=�Iq�P%���t�jui�/W����"e�(H$�u���y:jmL{���	����E�s�B�gR�I��$�%*V�'�N]�|J�Vt?O,i�"C�����Z��W��G��nf�[��	c���o��w7�	��������\{�H�2i/09Z9%Q���?|g��x��
����o.�'^��g�����!U��
&��?`o�@:����Z2�O�c���b�'���Z�9��;�v�_����}-�L�l��������"	����l2l���{8+��!�L�rZ0C*��;�~>O?���t�/��v(������6�
X���!��lQ���nh�h�\�!@������mh�h��Dt9L1j��&��:�NeI%�;i(�}R�B9��z	�TH$[���;�������]x~�}cwbd6������;�ZsV��t�o���N�R(��)T���WnlK1z��W$S+��������I�bq�G���S	��]XXkv7��'(��F"2����BD��_0��=>E
k��C��zI!���	��mS��#k
��L���{"��4��I���Jf��:+��uP+�)��������b?��������v�
1�1�U+���`��DJm���.8Q���Yq����3G�?����y�?1c=��_��Z9����7�.�w�Uy�u)$y���!���3I�t%����v(Ow&�jZ����9Xcg����%,�'"|�5�U���?D�8H�^�����K)c��_Q+�A�)�U�1�����H�.���e����MQ{��~��U"G�);�5���]�6�I>�Idh�_���z��'�)�uU�+M�GAEv]����E�|���uC��)���DI�������O�$>ldD��o�9�![�
����������z^���}�4�-����%Z#�>��"~	��������U,�	����"f�y�gO�p�CA��xr�@��#���QA�/^�+�2��%C����&�/��m����P��ns����y�u��g��#��l��������;[}�w�q8��F��""f��>�������Dm�KN�!t�+����Pm ��&������?����y+��k����o9��,�<��P&�Ej���gk��e���0�^h������s����F���������2�'��c���6�n����"�J�z��M��wWU3GB���7���l+j/��[���������fj���q��v�
�Ku���Yh����4����te�DjK�L���O`7gD>�c~�Qf�������y}����U)l��>�����#��g��us��
s2��������`�z��{�����j��9m6���
P����F��L�
�������$nW�"e=�:����Hh�������4��6S�.����:i���	g�f����]2����p1�g�u�Y��|��*K|��5��
�M��a��m��v����`K��.dO���a&�������������d\���01�W��{�	�
a�(���eKaK�;X��?�~�����w,�RP*�M�^Pk/��!�I���Rhwz��BImLNjk��q���tW�8��t����C��������|������3tg���deg>R�9�.WY�E�U�v������^���s��ou������l��t�&NM��a�� s�
9�%�P����~�M���C���L�Ce��]H���I���2��$�����Nd-�x�GI������^.�����x�n�����o�0k�9��� D�v�ok���X��^x�$B�bV�d2�BPg��IHx�wef��5�r���^��<��x�,r�h�^��,���B���H����~���FX�j�Jz�Z�2�@�� ���93��W�M�������W1
xY$�{cZ�9h��ew�,�D �\�����k�{�����n��a����������2&��1P�0��L6{Zx�`���zf�X���G3�s3]�L�J����\����Z��v�80U-����s��RJ0����@X�pl�o�}�&�G_��4Z^AK�^�P��#�����O�^-.7�`R��ZE,��nR>��'Q����Sk�vd@���k�Y���R^���c�
a��	X};:��=�jxl���x1^�N�~����������1��N�5z�B�@�������/�{���=s_���L���E�]�gS���)�3�p03���@��>��jrF%�OV�u��u�3�)N���e�2F��`]���������ky��z�������L�
+�!da@���I��57%���'y�������������[��n��]J����n^��je|����+�3�:����{�pm��`'�v�fw�U�����������"���rV�����i3��_Y��X�����YO��.���`�h�]�GEo�����e���X'��]��wR����.`okmw�g������v{yx���`=�Q�Vc{I��n������
���k�kg*iw�v��v�.i����^���vyJ�����Wc�~����p-������.aB��j��r�W���.>dw'iw��k��kcsw�`0V���������v����{�;��lGu�Jwv�v�+W6D��Wm��e�2����O)��QkV�m���{4V�I^ia����c�����s�������z��v��k�/�2~5�WF=x��������������8�W�&A�G��#�5�SR�g;H������Z8�J��BU���0��"'/4�}����2�R"/����������k����Z��}~6bg�|�G�����E��h�0��t�"Wkv���
���;����
���3��*�L����7!(x����T�,���}������k0%Hv�`Ws.��sMsgj'7�'��y�c;�Sy��nU�p�U�{/���1W�0_���eD�kmh �<?�5Pz`����r���N�_mi�����iz�R��=��b�J���0S�U���AU��pGx�Y����]?��X��j"!�1~�e�4�������Hg���E���4��+�h��j���S�:��Mwb#Q;2:Jz������;^����z��$�d�3e�~�����yWUs�Iu'`sRJ�[KP|9��H1\�X�[QL*k:�O��?N�s�i�/�i��p���^R���n�m];'��:�8�M�A���������y:�g_uOG�rsl�>�N�JX{	T���&���0������Qb
W�M���_��1-r��(A�M���=}7���g(
w�����6�������?��y�P�����$�Wn���lPR�v��R�)����Z0����f')�������#h�#��N��w{Q����f\A��Yta�o�u�4�w��`t
��� $n[����08GH��c�]�WT6��pl����������IC:d�X���j���Rq��r�$�5�p1m�����[z�����Y��_w�6�Pq�k���\�Y�R�Cw&�XZ��/r������I�HX&��_1R!�h��X��mq�^	
qa�;7X���oV�mZ�3A��if9��q'F�#p��A5��:N�:=�P��������x;��#��L���F#<�?������_�4 �]D`?h����1��X\��bCo����'G��q��~����]�\%������������\����6���~�����/�K���z1c,r^.����+lW���XS�q����r���O#2�;�����Z��N}I����@{
I5����$}z�Z���/Z���w5���o)w�g�#?xp�!$�Uu�v=��^��R�/��/��j�8;�:�n���-�����YF��g���Y�&]������F���$[��
��x������$/M,�LLNA.�`�Cp3�
�-�Z����Q�V��<n�;�/m���� �E�����G�:�i�r>6����,��VV������Bd���[rl�'���~0����Z�W��k-�=kS���#5%�����NU�������!�K�A{x
��8�:	�*���5b�	Xo�Y����� �|/~�������a������)�k�K6���]��;n����v�@u�����g!�aS�g�n'�L
DQ���p�����x�������VOp=lv�6��Sz)���;:%~�y�j-�����S�����_�d�Q2���1�]��\o[���pl�&����lXzXk�2���^��V�<m��j�]\=�v����#���5e��u��FM�e�;�����`-�3��I��:m.=�mZ�h`���+�yf������T�V���m�!��u���>��*��(X��"�2�@���dg��R~g��u�B���9/�w������u=����-�:LB�����^N����*�4����e�����YYODD�k�,T�_7d�nwG���Z�P$�Q�?�r{g��Q���)/�7�39M�Z�Q�wz%6�TK&��Ex�?�����OSdp0�]�gQG�V�S�=�(MB����s��S��!E�UKB�Z��$���C����/i�����q�
������}`1Rl���Pc05�$Y~XkuC����nM��V��������
`K��d��8���aJ���50-��\��k	F�Efb�V^�M,�5�c$1�..�H���o�
�y_�Iwb*��9es�	'
n���-B1�G��2|	�+�	�	����c��dUU��L��>��m�!�C������}�����Hm���@���&d:�.�����������+u����o8��d�~����XOc�,X��,]�3�z���W�F40So����&�AZ���`S���t~�a��B~�f=����Wo�F��>�����:���E�2�2�]����� ���,c����E]:��qiE�N%p��Y���q[��GM���� ����CL*z��]��iQ?�e�kA�9��=���~�ia��f&C��8���S"�Y��6�#
ck�0w����bR�p�T����12>9�1kQ���7�D��<|�0��\o-����"�9�S�[�%��R
�Kju��
��q�,���vv>�!ov
X4�s��Qv=V�a���
-*�kr��a����
A8����Fx>����E����-;s��&v\�_���e�KW���SX5��UX���������{�#��)/J�jUJ���Dy�%���l}��3zM3����yJ��VO*��w������p��>��NYV���V�.�52[r�z��0�ql���j��t�
�����w�5;m�4��
�[�h�k*1<��6A���2�t?�fgA2o3�`��Z��{��XK6����L&)JZO��sA+�!���f�FE��!]-t�m�"�T�^n�~Fw�w+�EDY/����)Fi������za�]��v!kz�N�u`�O=
9j�������T������0�g��x^w��!��;�8)�k��������j�/A��,�)���5�L}�����c�d�U��8
=��������a#8)�&�FM���m`��A��c�ysU���"����w�s�e{n�����AD6t|���o
��\�f#�����}:v��rgv���A8��jAnW(E�e#A2�M���Q�_���=2o�'c�^t�aB��t�h�\����Z�>��'V�(3�FI�\� n��`����J�c2���+~�����=�.*w;�tfA:�a�HW����u	���B�w�?$i�a�3��������0XExR3i��WG�.f��q��e����E@�S�E�����
����wO�i���*SO���_�q\�*���8	q/�vrZS��V����o��5u����	xK5#�Y����U����?&�o�e�����L�~��������y�d]�������L�:g�����e���S��n-��l���[��!������T��_}Xn��Ugo���>�zYaV�J���i�������N�S���]� q���73��H��@������q=l���:�9�'��mGt����^�����n������E�NS�����Q*�q�d��������"�����k������cU�����%����i�+�����Jd��\[����'��*����V~n�Z`�V":V�q��E�
��d�q�yj6b�������{��w�%�
}�A����RG�tA\�{��<�p2���g�Lc��:%S6g��6W8��>\A)�sXy'g9�����t={��	;SD���+��m�n��X�����On�>dCG��X�d���d7�g9&��A��Yhn� ����[N�R�k��TQ�4�����@���A���a��K�0��je���!���uZGc������|�?b������u��T�L
o-Z/t�Yj6I����#��������48��O��k)������.�����G���!N�����z*AZ��D�
y��aFG4�fu�������OZ�z&�5��l$�������A����t�(�gt��u��v\9�c�t<%%�1��I�d����E��~Ouj�*�
7Y�����b7�S����6XYh/�U���"n�#��d
��	=�a���m9"�4��f:��
�i�e_�e%�vf�i~���k{2��mDY�8�!�/V,��M�������-�5`.f���qu�
����?�LaX��������,����,���*~����3X�XM��l�Sr;�����Y�OUw������a��YZm�bj3v�+�1�ff+�>l�|���qLH6��VA���������g��Y��l!��m`�Gn���
^��#��u8������e��������[���[�'mtm��MN+��i&�����%KO�|�`]���'_)wj����k�r#|9��E:2���F��O���Or�&y��x��V��}�r���=i�
�
�5+��6�w��m��/����ghRIN��W� +����d(c"�a�I�a���500.�*@�Q��P��y����`��, ��R�2��	�Z!������qO���E��2:�����V3|������4^g���:��#��$q@i�'������ ��Z��G�B|8�������t���Z��,�e�����-�Oz���S�c����_������I���|�-������Vc��h���7����	���k��%C=�s����t��|����c�a�f��p�z����������tA�Y���&�X��vyJ���E���0�K�J]g�SL���H���;�HGGo���T%l|���]!�9::��}�W���"����eG��h�o�*v|��=��?�
XC��
��M��6����[���+�l���M����^y5��x�e��U�R43���������eU������-������U+�u��o8p�P~��^��V3\�����/�9���wU:!��8�W��u�3 v~�||NK�x�u�2��g����[n�d��
Xw��"v����k��G_����)�!�1>������F��L��
��g����C{�In�j��G��)b��f^���i��d�����[��C[b�_��u��j&����7�2�F��XMa�M3��z�3�����ft$2wqR�\�b����C��f��7�BD����Jr������[J�E����&�����=���cX%�t��Dio���O��u={~l��Z����
)%%����t��T1�����yK	�C]�%fz��HME����v��sK��D�	�N��r��qg���G�����xj�b��r����Z)'{5�6<��1����9,���� q�P����<�Y*���
7A�)t1nZ�M��W�^����
{�i��A��(�c�3�i�{]��k4>���M����:�<$�{���y��V���LO����$�j�e7��Tc�,������Q/�����B/�Z�A �1�y�^XI�]6(e�Z�x����
6<�W�c�s��a����6z(�d��^�]!�w
�lvG��l���!�w���Ze�98]N^�-e1��D���I`=��[����n������A�V��������Vr��*����`�|��������$/�U*���#}H�+Z�N���s�����^����m�-;���E��<��bI����d�:��}�,jOv:������W�>�N����{����������Q�B��E8���'��,�f��i]i$=M��R2)b
��6�l�2�-L�\�K����A��'�����sz���.V�X�N����\�7��<�c"�W�����_�kD�����U�����5���6}F��(�� ?8V��r�3`iOR�}.c��G�[�6�vn�$o�YU�������Y��\u��
'�����[��irH�oj�S/vF�j3S�2�X��H@����Es���9u����	�_]^�5���W-O��q�ZW��9,�"�F
(6�{;?���U��MN�W]���qYsNHP�>;{�K!��Kxf�����t*�h�o�Q�Z��Bvz	3��Io�����~N`E����>�/tX����t���F��������cD���AI-se$$�A����)�)O+���\<�c�}�fs;�f�[��3m9_�Y���
�$��$�$!�+�J�^.Ss�S��������^lkuJ#.6��K?��}[�(�:�KZ,��K��Xxb/f�Q)�|����]H�.IfF����$5�������q�X���~��������\��Y��'��d�!��7����KH`!��M�j���UMN�.ls���H��)y$-���2f6��i�47��^�0Xn_�����b��	���s�j�9Z��P��h����lY~�;��#J�4��#�MDn����18[{�g�#<��
��H@����N8S�?*@I.�]�X�����ij���q�i�N���vQe��w���
�e�weFN�������`����������>�M����4�E�i��I!��y��&U�g�T���������hQ������F2�����.y�d������������'�=
��n%�7��5{D��j�����Q�r����x����]�8�����8��X|3��V�N�gf�^+JH~}�!=c��wt���I"}=���������3/���h�ms���T�o�V�#�,��$p����%}%�%�n��Wz�>rl�'�nlc!��j�v�gSB%�������2�C�����L�E[�V�0���?w�M,E9)x=��8���x�x�o$B��\��������S/0urp2�e�Q�g�1=�-���tk����O����~�����L5BI%T��6T�Ym7��e��YF�}��������0�X�z����@���������Ss}I��~��|I]g��"��
��W27�W�y��Y��E\}���&E���E��������+��(�1�Iot�J��rM�ZY�Grj���R���EW4h�7��kiy�rl;q��U�l�S�-�=�4�=N��/$n!�f1���F`�]ivv�xD���U�x����g��+�Y&���{��JWN�P���rp7;,��� d?�����1����J��U��v��gs�'u6�����,����=��FN�J_��B����YL��_�N'�T������_��$�����q�&F�Rz�����\�.@��2���������������$�"�����^U��	�����A��
���M������)/q��J�6^�mT^�
�����p�R����
��G��OWz��:�S9��R��iR���Z36�����=��&)6���J�uX��@`[���b�0�mH������a0�e�xVM�����i�J�������X���/E���MZ����6�-C'�&���7/~����{��$�F6.����=��;'�5������k>��~V�%����!�2���r��'��E����bC�����u��w�Y�-E[7���� �[=��6zY�\4��.We�=��E&�������I��1����?�o/�i�rk�M���k�����3
�������������-7�}�����&���js�:����a
�,���+'^Yt�����z//�}�Vy/��	G��w���U��O�������$��t���������<i)b�7q���n�B���IU�W�C��W�$U�b�{m�bT����dp�v�:��+�\/�*E�p�0���1�0.*���V<�$Q?$�i{b_]>�1���?ZW�w<~c����,�0 G!3GL2�_/e�=�l[n��V����+.~��WZ�:J������}�>K������m^���$u���6[�H�A����Z�VA6�	!$�9�K�uS�������[?R\�&,�3����r�{�[��
��qI�-�- $>k��u���ZT!�<g�N���a�$�?v���I����;���^������l���Fmt��l��s������k�
r����������n��n�m���������b�~����"z�w�0$��#�o�?�\����M��U�|�u[��E����+?��2.�x$Z^��-����S��i�r:_
�])�����Hq2�w��~~���t���X���%�5U�?!���S���_����b��C����(�f�"�]��?���ASSb<Q�t.����g��-�	|4/O��?�4��.���A����C���|]]��,w6�w�:�g�bB��-%�M�������+��+���=t�h��|}uv�T�����\�|�Y�4��'S'�{
�kW�G��'~�����<a��������q�qO���[����I��vZt8?=����k�5@<��-G���}�E
�����4��t�%��
cR��b�����2���'���s��;����JO������6����^��o>���Q��?��6?p��%�]%���w}������zZ�L�����2�}SF��s�Y�ZVV�.��?���);��m,��t�qZZ]�m�8e-��������U����]+��Sq�O#U���YZay�.~T��[�L�w�o�3��u�c/_}���VVZ�o��������j���lG�D+I�A�t�yWF�i���������d�W�W���c��
��L�T��Ikm��o^T����x��o�e��O���w#B�^O��,]�Q�5u���=�QP2�\Kj�ss����3��W�E�c����&�����8��r3.�*��Fy�nw4����d���k���o���S����Q��Qxr�i�rgZ���I���+��0?_����%&9�����L�x7&7����LL��{�,����E�S�b�/�fF!sFL���3�z���J*���A@��X��[���yp����a��u�;�x����������[�OV���}�L=�$T�����,3��6���c�>t#��6y#�,�9?���VS�����������]��wgV�U\�8�����2`�j�DY|g�z�(LU#��9'���G�L6�	����=a0U���0�>9+y���>:J�Y��P���\��:���;�����b�����9b�Gu�6���T�U1k�2�I^.=����b@Js�7�J�F;K����jr��.L�;�Z��Q�aa�R������f�vm�	:�#xE
+�6����_�_�V�|c
�.A�Uk[Y��9<�S5_��Wi��OF#gH�F���(�7��^�}�\N�����8���=i�:)�U��i{y��8�}
S�fA��U'������!�0r$#N]�Y��+�6�a��*����1�Sow���cT��h;���^���`�
����F{�k�Ms|��E���#���5Gf�ej��K��=a��l�$F�����G����a�5����-(�(V&�I�ZF�`�RY��C`��`nk9���/���e~�M��uz��Z'�.��9��IY����^%���^����i�Q~zS�d����
���*A��_*��s�96��������=����q�����������Q/9�f�jQQ*��9�5hb�{��o�20x����E��T�������t�f�������h_�:����v��Z����tf�"=�{���^����
^��{�-1�"�#�%<����/��G|�e���2��4����AD����I�E��i����t����6k
�-[*d7��(VD�M�g�9�o�W�s��sk���$)�^����*�6F�-���
��6c	~����MN��Ip<��� 7-��,}6C'N����������I0����=����/&������� �(����s�����z���|�4��n���T���	�����_�|�6�*��V�
�]^+���b��} ���=V<c���\�D���?���x��i�����a��y�ro�����o�y��T��_�d���t{����������B<HR1���|��J��S�{�	V��N�8D)gE�s]�t���0k��s�?����&3��
Aq����-��	f��e��M��	�<�q�e�b\�-�������������*�%o3�h�("�f|���YU?N�~����_�Er�m�����Oz�Te*��e�z���~hk��g84b�p��h02q�@�#W`}]�ltq��tM��)#a�x�s�s�T5��E6��{#�+7�F���9�<)W��z��������:��kh�p�������5c�X�S���.��t��*���F�������'�in�����VD�ZT5E<�l��6sC�I,z�7��Q�	Q���5��P� 
l;���3s��J�6���
����w\�f�Z��,���R"0	����\~Z��	3�5;RX ��?���P:
����q�m��E�y�#���X����)%~�����i��Ty�z�^�*�����K�A-[}���@�j0�v[��
~?`s��u9��2I9PE�s��V)�������T">.r���k�B�):-@:v�]]G>����$M�:��N)����-�p��	�Rp�Q��Q���k�h�b�*�|� �����KJ'�y�OW*/5�5�7�����f�q~����d��o�<��'��"kCFh�u��t�
[��wm�q��/T)����
OJkV.j����zJ��HL�D�(K��y�����M�h%���SC�@�tD�I��|�V	�s)� 8?�s6Q�\_��)~]W�o�9�g����w|}�z������2E���t������$y�h��U�/����6fV�!��D���~100���l�X�������_�Jo�4�@?p�/���l=�*x��X��%�����T��!.�u�;l1�OB�s������[�s.�o�i{����[LS�:8����6Sr��}���+��Y�v1g1���H�>���vF�/���g�R���w��#M�����4l��;Kn�4|O��#�'G���j���j�Lu�!��> G!������$�$���.������?�D�s������1?jO���{M-�B��l�k���3���AF���]�[���Gy��>���F{���p2.	�O�#��������P�x,��-��PI��7,��#�x��k2��E�@��-�<3��yk����������!U���~�������%���y>�0�q��]_N��S�o����*Y��*������vaY�vQ��<%N+��\�k&U�?IMvMS��$�I�p%F�w�j����o�������j{[��R�xKqoq(���R�]��S\� -^����Kpw�w������k������|c�?n:�43��9��g��������z�	���T�0���C��z���9:��?����v�<z���M\i���ss2���L�1I�E�r�sZ)u��252�`�;9DgGU�@N��iU��=���nw��.���|J��=��L���!8gj�H��;�-���I����n{�y����O���q��2����\�z�r
6-���jF���C����Q?EL3���2��J�H��'�^4fY��
�����z��}�1�^m"��1�,�t����	�-5�o<���e���-Vsb�N��LA���ukL�c�qP����\��NB���(���1�S!���P�d����}��P�(��*4BtT(�@�J_�g��2�,9jD3�_�>`a�I;�_��f�J������(��6���h��l8��=v�S�8�����S����H�t��b���bg/Z�<��G��_����t�%��\�V��y�x�\�$7�d%9W9�S�����6��V�����t
�-0����E<7�u
���o��}�Zq��HY��_��
~������"X��Pe
 0���;�hU+m��-@l���2�+���
�9f���%�dc�w����KujK��Z�1
I}J�W�v�1Dq���a��� ����PvXn�\�>�p!%b�����_�^�i���i����fP7
&dW�P��'��)�Y���F��)E_o���V=~������2r:^�i��(��h�_��B�a��;�]xYz���)a�[oTIW8u��3[,D-�Gk?i0�$�K^Y;�a�}c���	$�����0����o iYY���btS|�i�`�����������S�@a�t�����<�\`4l�9���[�8���D}������,:�r���s����}�C�t��hL?X-�D-�
��2��X�y^U�!��h;5p_������h��������NegjC�-���8�NS��Cv�47���a��j(�;��ilC�]�V���Y��?b��~�%�pS�+A�h-X�G������t�N
V��?`ML��f��fy�h��Q��b��rv����^1���[�S;t|\����D�{��9�L��t?;3��n<)��l\��)e����Y+�:m������<*��	g�������=���U�>c����Oa���BX_�~�S���M��=X[s�l�%�"�m���:(���dSv,��j�fG���@��(����xOs�a
N��5-��l��p�����w7�3Sf|
�����y���Mr|X��qD���GW���8��AH:�o~$�=W>�����5������_R��O���l�4L���������5.R�ou8�5�n����V5Gj�"�Q�'��JG�qyJv_��
��~�RO�%���+�V���lr� ���?Ga�g���p��y�U���7�)*o� #RO!L�Pi��V���4�J)S�K3=)��]���OY�{S���Q��,��oT�a���h��51A�b;�����6�1cR��h.�
��?bs�[O��T��b6S�a>lc�#��!�M-��
���#��p�g��;��oD�U�������*�&m�.|2���si��P
Zy<�^{)x'M�� i����{�e������ih��Y�F�h��s��������I�y����������/�v��sf%4��A���-��(rE}�����B�C���i��V��;o=��������E3:S�\�H������X��qE���&_���,n4`�l
�
9(���?�S?�]�P�����jeT���?V&��s�
���H;\�@pP��^��q6�����y��������Js���[x�8O�V����������0��K��5�=sr�v���p���rM	���aU��.�]	tB�p:da-3��,#}������w�5Shj��eYHC���M����_*1I�9��xke�7<��F��Tm�|�m��Rl���u��lZ�x���D����R��^N�������]�f(�)?&����s��G�m�[�m�d���U=���Y��
����������w�^q��k�&&� l����Z1��c�w��j�c���SY�7���� ZR����v�d�S�%����������Z3�U%��T�F
=����
�.E��������U�e�������s��3��,+N/(�>��+��w�����'����]���Z{3$�$���rX���>M+�UH�ie��=�(��5���S��3��gW�}���c�!�Jr^������n�6�g����{���`ibKy��r��)&��������O�;�RW�����?��	[pg��fl=��� ]�$q���=Po�v��j
������+��(�t���JB+9����r������������Od|"Tr��Q��5�B�s�mmf����A��Q�����7�{��_�������t��(�T��8U����I(��CA�L�����L��=�X�9�Fg�	'���vEq�U�e���6E$���Z�Fg
��Q�����2�7��\~�*wv�A����}<,�z�������p�f 0�!|(�9K�g�~�^%~:���@���XKNd�o�d69��9y��r),1�O�N���$�����}$��l�gX���.@hH��9�"��R��YExKm����
|���	p�ta
����L]IM�a���(�]0�N�u�=E������v�iW{�RR��[�/�_���,XZ���t2���L�&�L��C����i��Mh������?���"�Lr�"a$�U��T��oj(�&Daj�:��L+���KI�&�mE�X&lZ&��R1\�1x�������>dcJ��<�����HT���vk��0�r��(�;@W�9�����9^����P�w:V`Vl�C!�����nI�4���i�:����R�4Z�c�:,T��HC��#.�_���{2�����l��Q�3U1� ��cr��V]����0
�W�&Zt�9���yn��Yw_�����Fc��nVIu������+W����)�g��zN;x�:�t�pe�MO����?ft<~T�g#�(=G#��),�j���3w'|C�k��Nsy����������	'����n�Y�b�������/}�
�J(ic�6\+)+�qU���V2����Q� �y&�#��:RK*x�yZ��2^�g���|w�,����8C�m+������d/�g�%��K��HX*J/�tSrg~��������j���/�|�m�;�S�=��v�Y�&�h�3�Z�k%�>'�fp�7�����N�LO�:���*��1a��&��c�C.�C���hr7
�	�2�:�q{+*�-$�b+�s��w����zh�4�Df�x�����.k�|.�8�M��%U��Ju��O�:�s�.cs��A�c6��/<���@����2nm.t�eX�ZI�������v;�
3i�YVw/2&�=������1\��^�p������XA��1��}����%RNU��KC�+o�
��t��kY��%f��r��v6C��	G�S�����8����x�)����d����^FC�����Jy�r����Y%���*�Q�O��3�?��e���0�o��(��
\���Y��K��*��6�����3��jS���^���d�����O�$��fx�Y��3�TkWbVt��v�2�j��sA
�!����1w��A�5��
��;#�dI��a�E�E��z�
��q�P��{��
�;��bLf.%��/NN����(���s]��j��������f���U�r���/���f
�5��,Z
#���h��
.YZ��6
������4�@�a�g

m�?.�g��l�U���e�OJ��Fu���{���!�'��P���DG����Ye�N�����j�Ag�7X���r��A�)���wa�r���-����l��<�_������k{�/�$;j���O�����l���>��
�mT�h���>�e	P�y��P3ny��I��g�Ye������\��"U!�����8�A�?\Y�ge'-�K���H�"����)�y�=�	����F%:�}tO��[����^�GUxM���;I��=��
P?;5���G���E�oP����k�E��7U^�,�1����V�����\���N��MQP�-|�Ta������~h��y$����RK��O���9k����i(���U��N*w�I���S���bI~ik���0��/*	�����j����R�D��	��O�I��/��FIh�:���\H��O_�c]��}k�f�JWIJ�g"�/��+J; �q�I_�H�z��NM�^8%�y�{AH�b�K���~�]�m�=q���KS�N���w-�K�Ag~����_*�%�o\*/XouC�nS���fW���*�mm�N�M������5�q��t)��n��R�%���s)]�y�7F��u�lz
J9SY �\�x/N����R�%X&�+]y�+��T}�K35�+�cU����l��D��Sy��-$Z&�R3���S"*.Bs�5�
w�P���_��Od�!_g�N/�pj�gW���0���k��}� 1�5��?������Q�&��������S$�,�=Z�#>:9YY��V3Xf��Q�-6�r Z;���8�}jU�w���<�}��[�ps%��,[{�&?3(P���"=M������i�Mc�O�Z��%o����G�
9�";"����T���x�B������O�����`Y��Q�
��|Z7:f[�G�/"��������� ���Q������dVZ���!;jJ�dU��v��nY�1(h��m�+5�1`��u	s�/��A�0����_�k���B"�������L������A���n{��?�]/T���i%�N�/���?+��������R����� 	[#����~�Y�����g�S)���*�(�F�{��v��B%A���V\��4��G|�5���J1e���Q�;
�>�+����!�b�}�$��t�	!�5��<n&o��%�����Z��h.o���}�/o���=���]��2n.����G��;P/hT�e��f��0�������S�ay�
��U7�����\����v{z�w/�����L��A��v#�As"��q�ER���>nk"�������Xe����1�^�(�k�\/��.������E]*�SO���8��s��������j����t��Nrk%�(����{���
5A�j<��i���tJ��+���|�'[�.�����EZ�G�������7KeL,^��u��6uzd�����m�s�t�x[
����j��"yl�����~�Zy�;��=�,���#�qR���)v��cYi���Y�z���A.�3}\>B�}����wU�'�Iu\-b����_��[��;���<P��������2��6:��n�����b|�y�E�Hc������������g�-��z�����������z�c����
�
c�,�^�%���A����|=��m���Y�/�.���>Z#w��_�Z�|�n����_�����O��-T��li��n���Z�v����A���,k[g�����dZ������)ot
��G^8��o��!]�.
�%�\����R���H���'�@//k����� A���fh�	q�]��4��O>��r�����q����o/6�����6mx;��n���*w��nK���6�_�z�
n%-��Z�������v��:\��n�e�w�=zm���x�S�nh��Y��^l&���4�N����B���������7�A++���<��{�(d�4�W�"���
d"�q�MmH;��w��j�wV�Wn%��su�<����)��5���}�T��!��z>���p���<�W�w��@$v������x�
_���:��&1�mOM��@/X����[������9���\������PY;�27��+^�e)C-8�(���M�tW���)�6�{"������t�:_f��4����?�VX�/p,"����B�{@�����2����c�}W�������o����;d���.��K��ow��#p��5��cW��JT���7w����q`F�7�`Q��9���>0���w+���O��v�Zx����~�b�>#�=�T,"�`�[�7E��e^��-^����=(A$d�g��B��^[$
/���4����(m��-&WE���[g��/P��k�#:.2
6����
6_��]��1 �s���
�+P�t��HV,��23��K���5a:�B��k��a��=��x�H�fQ���	�Q4������"���@��&�@pW�^0� ��Nv��:��)|����|d%+��^���kR�3�����V����Q��c��<1�4X�}}Q��}��J�x���8��d�^�p
���:������~>�������j�Q�h���#v�o�|?��4���e$��q��e�����fox�I�����!�@V-�.��MT�Vt
$H���-Z�[��U���ks/F/<g�X������ov�U��_oUh��
���������A�K�����or�c���}��|fAKH� =����?������`4�`b�$E�w��]���jt7f�a0�d?�a)�A:sgd��<���v�)+�}bX�~�z��-���6��u�������@�,p��������f^���1���/��O`�-���R$A���+�z'��T!�K���Mta@&B>sim���(�	�����v�����.�<-�!9���U��p_�����[�Wqs9������{���:���\b%��:C9?����c�2;�ZO!���"y��Vc���A������[
~�a���b^����
�BH9�x��.������5h�����yg����c�F)?��.� ��'�|?��p+�p�Z�E��������F����z��i_��+���X�����}@9f�J&@�G���:9�-.���-F�9*�C5
����`����Gb~<�%��M��5���MI����<��H���D��������K�4L�^��K������@��E�R�T�vJ�."|��*��j�u��(}�_:|�t��=�^;�\���C�����*��|��~�������"����=��7B�t�������r�I�n ��YC�`WI-����J�_.bod�j	��U4I��L��`xa���7���������%Z�	{��N��H��ce@�V�v��O�#-�������`�xt\G*eS�V|�>cI0�~_Xv8�Z#.@\�*e����yx��Z���a)-��t��l�&�'.�Mh����Px�xD8YeC=�1���v {mD����!��~�
�u��:�!%�s�>����UC7����;����1�K/���'+�����;|���H���e�����#�����a�C�
p�R����=<��a�t+��&��H���P�����j�j����J�>n��z���G����=?��H2�l�����x��gd�j����l��
�8.��0zS��Xg�l���2uH�����G����E���F���x!;�}��S�"��O���-�����0��HaXn��3y���ya`������E��}5.��gv�d�KV_JP6&�!���Fc����G�4��n�I���"]d�+�Q�y�x�S��9���3;��*��Dw�?86�����K�7T�0�W~��.�8��4�k[Q�J�=6'+;�I<ST��8����4@D����^p\qoa�F��rL����_T�������d�j[T����yK�+�h���Q�<��H�m�^[����U�R�3����U��\g�B���BTj�U����zf�CXrq������	��C��q���z�@��i�v#]�c�x�(	���79`�{��-�����W����,$A3Z��=���E����O�P���NTH����@��T0?Kn�����A���/�}�I��y�f��m@uf3��h�$`�^�3���5�`#��@e�q���C9�;��"k�����\�q�4i-v�H�>p�9��y����S�0�|F�����b�3!�����������zv%�>d����\�^a��GT�5�c�c�8�	�kHf�����&���[����?`�D3c�`�m�
�E����
[��z��
�yo�����m��+��}���k����x�(�gM�Xn��`b323#^�B������Y���U�������>_:�kv
~~};�Tp�;��M�F�/�t?���MU���������-]aI�Q��:��J���D������"�����=��7�9W�>$;��'4
/���K��jb�us�������H�|�kw\&>i����l��Q'2�i@8J18�gMU6���y�N�g���h�Z��v�� =��q|[�]�~c+�����!�����.�T���=��K�1�_�F�nt[j�%*}���u�/}�y���[��J�f�x�"�M�$��!�K�����k���b�nh�;�c-JEz��-`&9��r�	�x�U�O(R�X�T������%QPH���iEMa��jpF�-����e�c���l/����k�;�t�w=���Zax�z�F&Q�k�@N�>��a���;a��������Zw�����PA��x&�����ll|����M�Z�~��0�bk�bp#*��9-n8sS����^��l�G��/v7�m:�9���g_�-�u
�k=������<�A\��c�d%�����%�1��|�Y|So�������b/3#K�(p�{
�K�&;�~����$c,hy!���RW0��7�pB������l��o���u�g6�m��EV�'�_�s����3w9D�1(���4�'OV��nA�>SS�H�<6��lq�h7���@n`�v���K�4�	]Y�)��,&��N|��		C��r+m"a��D�`��#;�?�F��-C#:�������1���2�>���}�����v���������a�����g����x*v� -{|YH����b��K��y'M$�$�����TA�^����u
<�|�S	�
q-��~�t=���V�[,��WK[�\�aY�/�f���b�W.���{$���WK�R��ym��y�l
�����&E���p!�o��9�BH�57��-�)�[b������������
6������80`a6T��Ok��jT����������v;�����+��77��d���i����
������x�Q�o�_�����vYo��-��!jFB�l�����wjx����9x�/���1���6?TVK��k��[�5u$�-�"�5����\[�9~h)ht�Vt�r]7�Lvr����:����/������^�l��e��n��d"����s��P�si:rH��P~�I�	�0]A����!P3NG�`/M��=���bp����U��Z�����$g�3���$Q����U�~M�#��������/���:��#��1�v�R�����j��m�W�(�B��|�y�)�x{*%��!����������)F|!)�}���b�+v�����L#�"�V#Ab��C���epC���AC���d7��N"1Hu��\���~��~��iV�`�+s\�H�e>
�����Zb�2$��>�����HQ�O�a����F�l�?���n>y����r;b���W�4���!��~���������h�7�|��������qn����7�M������}���~$$��9���HiGtmq�UF�k��2�4iRHN+����@��������������{��L~��?��aK2����
���[�!�AA[m�m#����^�[�}aP��������^�����F�����F�[II����w���,�T��L(�	����&����=I��XQ�hu��"��z�
��)��#��i��R��s;G�4k�e��{N�,���p���f�l4��o�������e��%E����7���:nl��s���d����F�!�:������	���������na���vD�����"HjsO����O �>m�o��)sN��g1�5:���#@Y�����	��8��b�<c�ak����h97����B�Z	{�j�-�<���d�r���B�l��\��Q}QU��^�������
�A�v"xF/R���-�x���eT��df����[��{T���Q��+���$���/�f����U�t��b^XG:���-&f��l���R2H;�f;��[��
��&E��.���wTE�i����=�@��h�W|(�.�@�gg�����Qm����.�$rL 	�%�5��L��.~��T�ft���+�X'�X�n9]u�����M|
3#��2D�4oXm�m?��O�t�p�47m�&���A�"RD��tc>^�`��]���o����a�2m���|6<YZQ�r|jq\XY�����v4�����b:$��:�����������i���L������*��C=w5u�_�F��?�1�u�u=hA��;[?J�QVLF����*���1����s$���N�)����P�;-7�~��F)��_t�2�#��l&/��g�+��{�����#������`avh�������QC�cptV�E�)��Q�:n��h���JHw4��}�6�F":���BX��u��W�o�;�8��������*:��Ct��%A��h��yU��+�u�B&�T$��)���n���=��\K�m@�mr��fc�WH�z��(@HF����m����c�j����'��`S��R^l�;��{F��`S�,�&����:�S����Lj�S���'B�B�[vl%�p[�U�U���_�pERG{�~D��o���r���ML���&�U���Rm�s��~=f���3���#����v�P�f��s������@���
^�E����%�]�L6������N�CB�XO�+�PJ'}�x�+����M�a.���L�w���k5��8�
�@ ������
_4+NP��Thu�+���t0?pun������0�Z
7 �N�AKYj����?�������Bt`Y�"��w���g�4'Kgg���� t�po�v'����D�.E���9;����<j�<#L����4�d����`�3�U�0������eX��
4���h�1�����_�1"t
o��o�F_�~���sH'�.��=�iw�p�UM?����,�G�H���=��:H5���Zj��j'�@�], �ov[WL`%X7
bT��G�����^LS��9����w�)�����s��=�v�����6�S�H"6_�e'p��f���PM��f0_��}�Q�o�g7%]<]b{��������%�J�^���F����vtH���-V7��(�|��"�BSS\���o�y��TK�,�II0^�b`�A�x��^_�7��K(����o����L����H,$v)@���h��D�t���\�o�l�Zhu2�����W&�i�d\n�E�;�8����L ?���Q��x��!b{�D�����Xn��v�Ti�����2�(3f�8�~m��,`��TAs1��iO������u�ai��I$���V ~YwP�=��*�������=Z�{Mb���/�"��YU��QM��z���bX��5�����+&�~�^������Tl��3+.�v�:i�$)���`�3��SN���a�c�
���v��N��<�,�<	�(9<-n�>�7S��t�OI��X(�hn#��<�<����s�oc����yI�fw�[�O��U����R~^E��Rt;e�����I�V�E��K���v|�xh
�N)d;��"<��G�.�r�pi�m�K���D}]����"���� ����1d2�:�����4�����y�u���2���a�,Y���{�X�[���]��1�zZ����_��:_������4�9*}�[E��v}�� "5U|r����k+�=�v�uE����2Q ��@P�i�������1�w�Y��HV��z��2�v��};��<Z�9��9�bM����Fm[�k��q�Gkv1�h�b����#�����
]���p��1x�6��RzA�{k~V��������2���)���#$��/�*g���
B&dW�kW�=lu���ql��v��������	E"��Hn7�a�)���?�AO��0�:[a��D]����O�@��7[�	H�>�i�"�xgH�'���T������.�����3(���n����yh	�"`���t^�<e�W��7?!b@�F{����7��y���������������Q����V:������~CU�2y��	F��bl^�Hp����������xKQ��u�Hk��\E����J�'+f4�9"�g+F��G��Yc�O�I��&�;�%�
Q=���f�������W�����z�M������3lw
�[��K6l�UJ7��xi��z�*���^�����b��������|�%��N�N�E��~���^����cw�Vq�l{H��i{�1����$�k�,g��F����)#o��������A��U��+d�~
�`����r��Fk�d"_�(��y~I��P�pE�U�/�r���Mc`�\�����
7m��<�rr��7�`!kt�x�`0�B}�]�Z�~_\�H���V
7���,B���Q���g@D��#,V���V�0���Cp������V�;�ME���0�_o�7�� ���U~�%Pn`{����(���8%X�k��9�y�A��������$b��5[�)$���++��S�f[�*Y���0�c�uu����8��}B����3��kL���=T����#A�n0�6?�W����*��4�x�����L4���8����,�� ~�u�n�����*�����s��HO{`�l,�_��p\i�Q�9~>����N(8{-���s������,���=������5��$�f�#�o�3 �g�F�Y�
�,�]&G��Z�<%-�~��K|��8� ���k8P�St�aOw!��j*y����|�
� N5�T;�Z���1�BR=9	!��nGe�q����H�-�^[�V�������!���!�l��e/{a���u�:������@�0�D��(��AA�k�"���WN�u�uE����z�BH�4fx�^-���ro�d�|�*jr�wx�Ha�;��!���`��|'j���=n��-�V �9}��0������(-
��?.������jy�i+��n����yC#2\��xr�����P�R#�����"��h��Q�N�!2��N��Y�"��3--^-��*�,X��k����*����F�y,�c_���]�oq��� 4B�j��}�)8�'����#���l�b�r��IpC����\��jX�v�uC��!���to'��6���/������+�Nr�o�Q]57���7�����[Bt���zqrQ_L����������y�O]�z�2��
��D������[���T��(G;��Qx/~��:������
��1������O���s�_��z�f������A�GL��J��4U�w��������#������_ ����"���P�?
J����P#�T��+OWI/16L]�����WG�"��Q��OG�z��S�v-� sb>���A���r�#�3��x+�"%�(G�s�!&Ew#A]��Y���`��[MR�2C�J[����RI��uK���h�������#&����j�� -�e�������]�����<�a�W�����}��k0<���`�R��Y�+F�hY4�4�U=����Q�0.��b��������G�b�#��Y����,_�1���e�`b�������P���U�F�Bd���+�����)j�b �Q��\h]$�Ry�#�Usi(�~-���9P!t
�5w�X��$vjvV��^OS,���U��s�e�����leE��TWH�{�?�]���V��[��K�1���Hes����
:ZjVo&�az|_	+������0�0��������	�~�T���Y���o�*"r{�m�����U���i��`{���8F$[8������YC�DX>�����.0�4��N�#��3��z�)$|��G* �!��	T��S$;b��_������1��JEG~+��em��(6>�Q\�<���R|��!:�Z�G��q��LTCq���s��gZ��?��3�l��%����R aI�CO������#q^���q)oc��'fW����L?�1,(����N�#x�k���[�Q5��������E��FP�K;�e���#j����SZYI#�����e�s��&���A�������*q	�6{������}�'��������k:�*�_Pm�R��wi�<��QE��#��e�0a��s����-lu�I����zY�:9���iE���K�o�#`���,�Z���GdO*b�����N�A���Y�2^|�w��d�v6��%������K�?���]p�
��|E���[�X�iilgs�\�����v��F>KN������0B�Jx->k�}����K�0�o�-.t�z]]����P�u{J<�
������36�c%<�B�
�,3����[
Y�Q�BY���@t�Kr���`����7�X��p=_��-�jl�57�����.S�!px9�5A	X��U)b����~��P������l������@��E�:��-~1�\c:>���{I�!U4�yu�zKg��B*�L�3PA��b�u�%���o4E��h�Y���D�������P��l��x������@�B�;���UTl�
�Rfpy=�
�����Zsr���*a�%���g��	F�7~ ���I����!bI��pf�������Q��o|`�0S���-RD��5zR/��o��:�%>]����36�N�3���_9�V9�>��B��tz���N�.�����C~���H�=�%;�?����%"�h�CfZ�����9�w*�Ujh�u�^�*�%gTr�J���m�|Y�U�n{�W/�"}s}B�2q($y���P��?m�WA������G03���*e�Q�-�R�g2k�L��!�P�h��H >x�<T�Wne��<�1��F3b���**i�_�|����(�9��C"{T]��i��B
���d��-Q��Z���&E�#I����J�A�D5!#�(�,�Re����"�R��O�?~�a�M�0~�G�\���S3�'EY��`J�'�}��Z�����X��Y�x�����+��������`_N�'mJ�$#5��er��\
�R���4�
>��qst`H�J4���������Q� ��\��?��"I\2�����K�rF�e\��t���O�4����0>�k�J�{ 6��k�������Q�W������r�I7���?S����bM����2I�_h��h����g�Cp��G"4j���C�����FO����q�S�ENJ��'�0��N�-^bBk�'-��ck�����i,.���A�V������eF�$&����1��i��Q3��1�@��������2; ���3�����Y~/$���s	/���	t����2�m����U`U�gW���F&d�i��|�����������I>���7��z�
k�-����UW+�k5����l��v�NiR��*�������_�Q���}B��^�J�������M�?\^)_����.|3e��y=[>��ZWD�7�j-��pDc��������P{�J��T���J�BB_����b2�WVNz��^�r�*m��3�d7���h���=��&<u�@����������*�"���c�g���^���@:L��	���_}�T�j����Y���r��2�nZ�b���D�(*n}�"�_���5��5���_c�T���WG'),I�H8�f��`�����#i�({W��,��������VC��c�0��$K[��gZ��P�j8�
�b���w�����
������C�V�,�%CU>���%������\�}��'����Ia�Wp�FV_��h����rd�*�7�p V�|
4x�Df��K[�@*�b���q��^�V��}d9�����y�CmR���!����!��U�kt$F�����1��cC��}�<�q9�x�����4
�\l_��,�=������C;��S����,8)��p�T1����������&T�����t�����q��h�Fl8-E&���b��o)�z��]t��jk�����
n����N�m�+�&�
c6El�F�_j������D�Ri���C��#"
��� \J���:K�����H���y0kYB����:}l+�L��d����U��T������-���d���x}������~K`N�t����q��x���{{rS���v����j�z�������Y�j�[��is����L�j@W����������V�&��z��x!����d|�C���E���	��b�o��}w�$O���������y<^bn�v{�������#!���"�w��|[Nz�E
�DZZ����@K��^�����6�:���au{��ne�m�{9�Q��WA�E-n$�r��eNs=S��{�������T��X�3?{���
��//gw{��������������?0T���.6��_,��X���'����_T^�x���igrE�g�����:���������������$��={��_�$�Im������R�-Dz�L���3����"��sv�6�n�������9�g���lM-�\�8[��:������~qd���
&�������g��
���������M������P�w�3s4qr��Q��o(}122�6�}��?S"�JG�
%'�/�.�W���
��C�_lM���t�����C�����������{���P��
1{;{�'=b��*\}e>�S|�5�I�/��&N��������c6s�����0������IJN���$����A�f�����:����i���	���5����y9�9:3Y[�X83�;��"�}��^���O�l<\���dee���b�y������dg����s�s?�`���8��r���#���������O_
������
N_���-��������v_�,����Ml���\l�t�����������������
]?�$1P�����'q%���
}��g����������_����bdd������VVV���122���rrr���!!!111���������UUUMMM������������������ggg���O��P���O�K�Pq[8����[v��t�}��&cEZ���To���T3������|�^���u:�f����,�&r�3�"�X���4\_���Iz�5�u����`��F���
�e� vIC�L�������Q��L�
�H�M��Q�fD^�5/z�&����_|����C��RB����u����y��N��|�������oc��
b��z�"|��2l�o�fK*nq�(��P����������=�K�o��a��lN�-��}2��q��3&�|�d�V�q�7����K�>}iAZu��D� x�u9����7���k�Ae�������?��|�l	�=f\�C�J��wN������R���a�9��V6����r��V�=���zk�BF���p=1�3�����2�G>�����f��c���b��{9��J���~���Y�����H����-<�
�_��^fV�R����m��������������|�o�9v���JUw�${�������Dx����A���.�B����D�������.n��)i7�4���-����#����'�,��Z���7��n��U���)�d�'8��C�B�H�"f����G�Vi��oZ��3�R����k�*bY����ND����#3;�O���6��m�-��8��v�#���j����!��-^C���w�;�Hw5���9+"�j
�����,�w����/D�|��2�{��b��3����zdh�C^x��c�}�M^&t��f����U�M�_�e03�X��D�4�~��\��I�q��,��7�^�5��("P#�<�"+V�0P���\~�L����.L�������|���U�n��hY����"{����S�J����_q#R����\���H����?j�R.P��~��.�{D��1��duy�w��'6&1��xS>�����I{	7���a�i�{�:S��e�dr���-������W���}���$��2�>��ad5����
"!f2����S�B���cSO[u{��
J���r�pHu����@;��,w���n��^���~31Mz���O�+���8�X^�
�~s��f��~R����]��F>�>^��s��7�5�D*��$h��niB�x��(����,�!�Zx#���X,�����
��g�����&��^������������W��l������jd������}��H�`�-����^�{��QC���[!H�n'!�?@����\ $o�[������.�@2�`��2uWB����4gweK��VwL����
����
b������oa�l^T��?��(u���dt�?���O2�we����O'	���c3�o]����he:���/��p��=VC�_�(���~�*��v���P]T7�N�	���uB~��"b�x�hFW�����1#�>��x*����
���y����!<Eb����w�z^_�������^���?
Gxc67��
D�|�.D����:�4}w��"v�;�_�f�d�or�������7}��>����r�����M�������_:������v�	c�&��2��et-y��.���

a���g���g=����:m�Sl��U�	 �Eum��sC�������=;�Y��	���N��q �{!)=����:o	��su��P^�{��@)w������]��jEltk=�i�M���]��)�goc����S���$���v�lq������d���<���N�ea�SU���(�+���n�B����x����"Ik�j�d}e$�<E��>wfo+J��u������r��sz�����n����PN����lX=#�}����}���{i�D"�8$����{�������Qo�6W$��E9��J���J�]`����s��eIH%���&�h����>�VH���4�������jzZ��E�������@�m�����'�q�o���n2��R�Zj�;F��_�W�h-�%FW��%��Pkj�e�l��{�/���Z����Xa���
�y|�2��D�2�t�7S��b��m�<���*;5��6g`��4���E3\��bi�a+4l��h/%�y�����E-����(�+�1N�� W��y�M��D"'�/=���81�z�\^�
C��V���bC�v�����"�^db:��au�Pq,�����W������w&����_�.���`�L�j[��J�T���]�������X�������u��j��'��Up+��7}��+.����.M�w���T���g��u~&���w���oE��O��o��!&��l���7M�i��{��+���:Y�q20J&�������t*��;���������z-���in/I�u���'26v�U��3��|�����Z���Xx��9riop������n�\1+qk<w�V���9�S4������iBr'n�'�8��A1������Z�DG�������GOw��_��y�z��fr����'?'�Y>�;��6����U��kVPg2�B�q��4!���v)aE@#��)iD��f�~A��N��DM>�O0u��������x�	�&���OV�.��g��_SM-�b�����>�,l��1�]�R�%�n$����o<!�!_�,N����
���C�P�H�-c���Iy��������l�����j_�����j�D���Hr�I��W���s��O��7oS*_h�,W�L~!rfK#�}���.���R>�E>[:�%�GNl�#��
N�#����"6��v�N;�s}�jZ�d�O�K�[|��fx��k\QS����y�=[[�M`-v��|���=<E8��13_� ���w�
�����z�����.1���6/`������pn{$u�JK������/�q(ul �C,�?��V����xZr?DL ���U1�3:�_�7��(�3[	{�NY���H�8s�o�k�����d���[�j�T)�z��r�~X �=��u���gp������5w�`\l+�x�m��D�%5�{+�u�&&���$#�|W�d�0M#|I��/<T����OL�k�D�%���5J�$/�%�K�}}�����v��6|����zXP����y Pt��pq%�TD��]��u���yIub��7�B+���sI���n��#������i���@�����d�j7;�w���;8�wq~�oK9�O�,�1�GE����^D��1v;)�������}W^��Mu���q��,���_����5C�8yVy$���K�l
�ci����:��<��X<���+k?NN��,��e�-�&K���`�	5@�<�91���8)���C���#���%��e�����GX��^�������%��_��b���6����^������}�S����pv���
��C�P�89v��������:��ZAT���!�����i��~+OT7l�I_���,V�h��b+��y��z��/��-��!���&��S�s��m��X��"�����9�>$L�O!�ZX��|�6:,�V���[P�
����{�Q1Y�kA����tw���e�G�N������W2�����&���(�C�1w���(�7B&'&�':�d7�C$$N�
�"J
��~���|V��3������Z�!��`���$�U��D��<e�BAtZ�5���r�*���X_Yw����������(��Z�MtXg�m_O�������"�gXJ�����v���"����Kgx�0�uR��\���J���x�s��H�g7�����M>gSy��5��e���0�t��.-�V(���E
,�<g����z���kC�,-^Y���<��k�����hw�D}i������
n�"%	u�9=Yi����]h%�$�*l���@IFV���sk�M72�s���5~MLf�3��f�~�����F����@B�
T���*Q!�����b��:R;'Ma[�Y�1��@��[%p-O���6
��.����Bu:Y�h5?�I�������f`�9�S�z�$K��)L�JX��ot��s�s���Q�}������qv��:z��<'^7�mP��?�m� �1*���i��g���f��#,���D�I����-!VA:&�m���9�j��3��0H�>'���?�_�o�����8�����d�a`��g�����;�S�0:����@�V���^����RG�����Q��)��iY�����i3��x�X��QK�iyw"._I���Pe�@�c�7q�F3"���wX�g�/�������WJ2��J�X������T�C+�������r���|��]���)�u�}e=����-�]}w>	d�����m�CM�d��e�{7��:|�0�����1��������W��ug&3�����^	���MM&����S����b�#Z]4A=� *y4�1",�:,��?�
c��%���,���]��:"�b���s����Md�>O�=����g_x��JS����E���Gt�L��_�J�*�m��o������@_���%�7'�r����OR�?��W��6y'�p7�A?����� �-I:t(����"y�D��n_.�v}lOd��	H�q�>8e$Z��wh��/8i�� ����o�2���.>���`�m��Y#��|@�o8�:�t>Q�t+R
FJ\G�Ih�:|0���2��|�^~J�h��),��_r�f~�~�TR�@�'��������7���k@�#��O�N��"H�h&S���yA+�u��D���^%��������o�zE�#�GSz��M�S�a�������pA�����-������G�J���M��
��s�[W�o[/;Y�>�z����c3�������P�E�/5jp�K�������������S����1_��[��
�X���I*��pa�M�W�QJ����P#01��h��)G��Z���+�g���/O���{�x�K�-�e���wqq�����N�F�/,�7���r��B�P����!fW�A���
�
�I�s�����X��_�����e��)��-�U�h�*�G��43]u��/l]�����|SR)[�D����\��uN�/W����l��?B��������
�^R6rk|�z��/	��q�G5�R	$� ]�V�:Q�)�+�6\L�9�����Kd��(�����Rz�$���_��X�? ��,w�f-���-�W���<M��]a��]P1��TF����y&|�|!�N�m�w�
#o�#}�oH{.��&C��*��?K������n��������S� E�0��5r�
�R��n�����D��5*)E
Zs�i=i����|"���\� ����h���V�&'7]p�$=#�5�����3c������� �,R+�R�����@��)�����q��
�X?R��7��yv|�u�D�j���?�5=m���sb�/���x�����#�V�%E_���?z��n����z��P�rEj��;�$4f&)�Rbi�i[�����&?���a�[�x+���b>oC�y�S��8�,��������\���X��z���I�O�+��7�W9}T)�������<�'�q�����p��Ba�'�h� �d�#�q���l�!�C�(I�,_����nj���<��	;�Gc�����n�K�"��/���Q�C����x���I\�?%9l�tm�Z6�x���"_�7������I�����@	���1zw�I#'V��B��o�{���*������u�=8��
M1-����?��#�6�[�`)����3#���c��u��x�V��uP�V��|��������g��>C��}}��c�fe�V�������`
?��b<~����('^&j������'Lx��1����>>�RtZ�}����N��%��3w�X��x������yR�V��N��|���aAq�Ac�npo���_�8�����H����5�� ���4�����h}�M��v�#��tLF��[b�x�����@]������r����M_��Vl
�y�v��hJ���Mk�����lmn��.m*#;��������E}�F��3B�����9���H�6=��������jiE�`����$����@kt�f7���@�}��WPA��y���lw�m�������d$�O����1��2�p�f����X7��K��b|��&��������s+������q�������3���cT�o���a8|�~��K�a�����u��z��k����������X�d�WYD�/u^q���fc�kjd�~������4k},��]��������1��G\u(�v|_������[>�:=RP�^����{\��-BR��K<��[E.$c���G��W^�]��M�=������n���|�8�c���-��T�\CQ\���K��zc�E+@�DI�^%��Ol�^��
�K�^m��L�+��$EwFUJ3�I������/��[K���&<\�
�4��z�m��'���#b.��kYhb9��>���u�b���Zt}�p��>���������
tY�g��ok��{���N>$��k>�+�B�����6����o�R��eS�?��������h������59^�|^�q�.��]�~���$Lm0�]���n2��k�����4�U������������e9V��(Q���5�<O*Z��)>�������z����V�������'k�����^-�_Uap�pFZ�.��;�#��4�������q�L�d��$_zQ�����Rw�r?c��J�'�l]����Mz�_FQO��5�N��K1�/�V~:��c��SH�s��������;����`������i�^SU��*>����
������S�1f%2@&4�����k���=J8zG����+�_�B�{�n�o�2�"���2;&j.:��K������O�.���)�=�B�����������_="aQ@�E
���;�w3�=�������
����>b6�U#�mw�a-:����]�{�c%������+�>�K(<��(*��6�@]>c	��Q{�P������w����X*�W���%�b�!��&:�-�	��f��_b��!���+Y\m.�T�W	�p�IM��g5' �"a9H�����PY��wm�E�R�t��g��}�>|���^f.��[���c��->Z��ER#�*���r�:=��c]"#�^I��n4�(q��0Y�I��|sW�}o�KH:F�����?����N�`�������}�Ep�G�����b�c��n)w1{�3nd3���!�Az��� ������7�~O��MG�.&�e=�zV��2S���xh^�����R�m���tt��e����G����F�e-m�)^���\��O�R��h�yX�f�n�/uE��u�Z����5�O��f�Y�Q��t����;#+�"�E��f!}�-��};*�
���T�z�-.�J�����o���f��E�po���j[�	���)&1�������B����EIz�U�uQ�s�\��W��3�>�N�����;�����.�����z���CdD����A}�_��:K#`u�>�T�,4��A'�r;����W��
�����=et+h��QxOU+E_^i5D���2ub	t@y9W:h��.�u8�?^#�<Dr�2��� p��]HE.:�:����M�s�����a�@]0�1��?/����N9;�4�MZu�X�_��nj���X�`��7B0>��b�-�$� �w����4(�e�C���������!�������P�����Z)������HN`����`�a:n�f���"7b{s����'��f,-�a5D��/�����l1�b&������������F����(�����h��77�n�P��$_��u:��
�7��[�*�z��Rv�����I�=�#�/4_�m��Q�C��<�z(��1���c��d�<�7HW�4Y)3��m�i�dJ����B�^h�4o)�)j�������P�m��[�c����-Z�P�em5�����!������"����`v��VU!�)~��5/�/���JP��u)(V���1!~�S��#�����'z����6T�,����,�H��,:y���w#���7��
�KAj��OK������E�
p(3r���������6��@<��]�z{��^�{����5M��e�:�g�������������:�;oo���w��M���%����D�B��>[-���m�����"da�U�$����e�'�8����7{_��LxR�P^�j@#Z�.W�&�iQ�}&�:�4!�0��
��c�V~����J�{����������(s�Z�"d����j0Tp�����7W�~u)�q���)���mSZe�?���-��G]����2]��()�����$��t�����<�z�C"���P���r�N������wB'"	�L?�rT�'Y��)DQ�"�+� x�]�$�>��tXU}��Ng��'���%�[j�S��:�P��]� �YCn��F]���D�B�#+o�R1�����Fqw!��Y����s���'�gjf�
��^*h0��TYf �����r�%{�
)������fq����e�(��-iAYa�{�QO��bp hs����*R������|���K�&lKw��n{����c��v�-"7Q��Zko��=�{�gko}�����U��x�������1y���i�EE�kOm��eK��VM�����
���e�~��n����������W�/F(
o"��'�cI����t 3�P�'Z�����e�DK�3n���M���C�"�� �(�#��� ���I��U*�s���G���E>P=��������,Tcz��b�z���w���d�������\N>>>x�����L���s����Ojl��6�������x����J���������0O�M�K�j|�(���E=)Q}s�c����c4�""���J�SZ��/M#�t��)�QTd�LR�cd�>��^������t�K��.���u�F�
�g�~>m�<�8��Wq�����6����
�S��g��6!�3�=5!��������P�',���?{=�7���5��������/�7r�g'����?;G�7x�l��������Gg�����g�p��-�����`����{DDG����v�@
D��$����h5�������2	Pxu�������
�=��'��]F�&���E���ZF�$�K@n:C��-�~\��hU�[���|�[�4��'����(da�^D|#k(�������WB!�����_j�Z�j)~}{�"3�� i�����;S�f���L�
 ���s+?��6��-��t �Xs�##���e�A
���^��;t��yJ�V^�U���H��m�v�)��d����������������TY������6���a�p�M�������"L|r�x��m�EO��z�}<l�r��bb�����;����2���5B�&��;����b����6���^)���2g��b8��oL�S������#�s��4�7D���M>e�QN��^H�ABS3K�t?�� 
y#�R������X���V��!���fT���d+��v{L~':�x�T������3`:AAq����A��&iL���GLL8��<[��E��?����++=�5��d���=?Q��J�[��}��;�g?����Om��h�=X��&��[�y�B%2� ������d�������0��{%���������5>��W��^�XPve)��!tq�{=,2��]���^u������x;=����~�����V������>t��>��bz=
]i�y��-{��
��@��n��W�}���K�n��zA���������3W����M��q�7t�������g��<�m�w���xz{;x	�x<*?��o6����@�=�;�}N.o��a�������P��n_����+����u�QP��������,�1�z�� �9�<�����������i��^��?�/���y���y]
/7����@����.�k���o��aKR��Lk�}l�W�����
+�.vLygc�m�����*���>�R��4����\B�Yb�E}���}#FA�]�6��p[�ot��qT^\'b�;�0]x{t8�>�|X�ms?��[n��_ub�0�H�nY�z����wMu�]mM<��g������'�����LWU��]�=����+�h<��Pl���l�-�a�����{�������{���z���x�����-|�"jn��^�jKI8��.k+�V<�>D���^��������:�\�y��;lj�����^�����{x��]
��l���.��Ba����[��,�H��W��G+�K�{7������T��-����C�1�qG����Q�Q���f���r���P���G����uZ?����n:�=[������A-�� �����}�]m�g��u9��7��}�7�;�P,��~	������������&����k����������Y���C����:��f�BN��p�H��E����� tw�	z����������[=��x���������������������r����	����=����<v�Rd����x�^�o�L�5�L^��~�}����i�y��}wl����8������H�W��������_�.��=���������4E,�}����m:����{��eb� �>4x��e��M������{}����y����T��{vHZ�^��6k�u_�����=�����?(�����"������eRm�J����|���O7]����������EFg��{��"��������>}J�j}Q��v������������I6J�D"���/�Lx����q�V��N����rX���`?��M�W���:�)^��������n��2�tj7����m���m�wg��$W�#�"�K'��K����a(�w{�|��*=
ju@���V?�=���;�������Bg"�}�Q7������oS�Z�/#���dg�v��}�i���1����7K>�3��������}�O/�@7��&� �Qq����Ar�V}��q��@��`��d#������H�������y���p�
}�[���Co+V����F���������-�+��z7��t�n�m��=�������Q��+}�AJ�[�!��^�c�]4g9���� ��r����~�������uA���+,�"T��9���������7�������3t�l5����>d��w7��c�)�a�"7�=Z��V����v;Yn��,��h�<�=O�EV_�����K)�os�8�>�����i�����~��E�M���4m�[��M��A�L�*��m*���u��|E����zG�/`2�������
"tDi���Y]q��K�p�}�����Htjq���8*�E����;��k�y���QX!t�L]$t�>N�%c�a�@��`\���������<�'xq[�@GK%�yK~3��V�����^�,?�L����J�>!�g@��ID��(���bO�����geb,9�O�8��,s��7P+m��a�nm`���I���X����5�Ft���u��O^%{~X��)���	�w��>��oA�U��z����&o�s����M�Q��������f����3�cM���Cb�����M����x=�zO��m�f���t��#e���eY;mjt/i��9G/q��A,v��.�Y�XzuD�F���pR���Sv [�����r#t1}���~��lw$�K�.|S�;��r]��m��s���%Mo�I���u/!�a�x��}au�h�&�_sU_o$�p�p����mtr������������Mj�^��Cw�����]���"�5w����"��W>�6�~n������=/�#\�WR#@) (��7������J�������d����rd����m�\`�ase=B�$����X�;�1m	���d��N�����#����B����"���U����������a,�6���59�(,���\��+�>�-��]m$A���,���<�;y��e��c�g]�Q�9m���ar���U���\��������f���6��WVH�3��Q�=^�C��l��Mu���"�<"��������OF��&���D�|&�����"�=}w���������A�v��G�����+��
��G#�<����,�_����m1&���������O=|�|oh1��x��8H�u�h��R�X��Z�l�/��@��9���������ua^�z��� mvF�4ij��:,�"��%���*�_XA��vNJF�������X�U�;%/��8�#jo`,������=	��+nn�#J%�s�6S$�``�����i����"�!����x��1g�B��s�����;�<[��l�v����l��	������(���Z�_�P�?�Rs�����:�� 1[�n�d�����Q����hVk��jK�Kl�
�b��S���p����z��|�i���E��"��	�����o�����:�!$��>�6:��!��������(f�N��t���?1����A(O��s�[�m��t�{�~���������rU���������o�C��&C��5��"����UJ�+�+Nf]����}F��N��[��kj@���o��i���+$���l/����(�����v����<&��������QlT�FF����'Px6��w����7���H��aJsYT���B����KSS�6��JNw�7��C2{�:����|0���/?d#�p���pZ�\�4�D�2*�r�������� ��L��,�����#����������d�[r�B}�w����)�1�6y���[{t��Ue�6��
r�?�Haj���d\�����C_R����[��D��R���E���'���n�������6��	kN�t���){��]o�k�M.�c�x��U��$N�a���-3�d����u�o=���j��Z��0�E��,gc���/me�aYb�~�u���m���[@�L��3����/T9
9|��]�f����a�b�'�:g��E6��4�l��M�U��j�s't68vkiD��]�9��������7S��K�Pcy�#�,��Hj��1�P`H���f^���T8������j*�������`��T0Q�x������*	�DV0F�X����?�S��������D��-��n>�!n���'0����mc��i�����K�0�L
1�;�c�+yR��z������L�Z��]]|R�����I�@6A����a&��#��N�P A�<}_	�j���@+�l|e��{���e�I�c��c04�^�]�
+���+�r_����Fp�H�dH[����q��/���D�|O��;�M@�F�o��5�����rv��	�3����:)��l$���6�&8��5������j(�����}�
�m�Ve}��������mO-,�/��a������=��YJG��6h�-����J����9�&�)j������.��J��Ir����R�sd�^�~|���R�Wic5���K�
J����������c�o���������3\b��������pV�u��3a,���
��.��x�(fj�Z���:����p���'���N�ac�nncO��(�*��������3�����k��m�K�wz%����W=f^����
�����R�|y�����"�!����@(�����}��Tqz������?f%m3Y��-s�<>b��Y�(`�f43}��&��`3�'
'����j��s+�*e+���M���SU��2{g��P����|5�xb�!<�z6�`��QT�uq�s��q�z	��
��3\�������M����&<cR|+�������:]���5��p�p�1q-4
�06��������
;��u�i��g��P�g��y��.���
��?�:~C��C�{������Q��l�5�0.1wb�6�{Z���������v�A���W����%z2@����x3�=��>���A]��N?u��1l���E�H��\z�G����H�qy���E;=��8�	q_G��BKI#I�'�~�(C%q��#w���JV�)��+kG���/m�5��K��{b��]*�Q����>]+��������������k\_)2%��c���Q��}1��k+U_�"e]
5������(WB�����6���dU��K����w�������Bd��\I��T�WRZc�/�����<3������y|���"1R�x~|���7Sa4.T���-�
�(&��V�}�Fwn��R_o���C�hP��z�[����m��jF=�RFo=��o������&XH���j��|��L�O�F�#���2�5��}���5���O��������`����HR��swJ&o���������=!/���3pQJ�z�������b����'�R����?����Cx.K�R]�g!J���UR�c��g�M�7�������4�A����/�vn�MU������v�O������'j}H�5�w��M���R[�ka�i���6)}:�uf�k�\�l��w2�'���RI����5���*a�~R9zi�o��,��>�_��
�(h�Z>x
0�a��tL'����/X��xj���C�?�spM�����,�[�o�������K)���e3��.��9A�7�"��I"u+���k�,�e�����G-j�C����Ue���������4�"I�EY;��X��"^�f�B���>��_�HK8JI����'|"��t��<����6L�:��j����V���(��Y���9������]���(�1-�j��n#F�6�o\�IU%FJ�,%���jL����l(���?�p;5;/���?���l$���ce�������
�]�@�p�_�F���g�����v�K�W�H\�a����Y�'��	�?<p
��3�2�����0��)�;���� ��6;���K���i&(�u��?��mFn�Sj�������(��S�m�5���s�4�9''�����wq�3b��n��Z����v��Q�4Nf�4�f���a[�LQx�l��4�Z�7����z���J�Q��W~L�fY� ��7+�Yd��"�����ga|��	3"��|��s��BY���-\�^��f��O��o�&N������Q�Y�jk����V.�������-��[��<2��K����[��7��e������>��6����\��R����� &��/���lC�����i�d�M��5������DEa��w&]����4�
���\m����Mt���?�����D���_�{�pWe���n�7W��E��|%��J����<�z�pEm�ji�XE��U|\$���������^��;��l���������#$7wF3N��*����y�w
�m�������N�k�r;�c/��^��yzl�������I���F���
����������Y���H�]�	u<M��U�����vPt��N��iV�i����?b
Sdf>8Di�8���l���i
�s�?��t�e!l���d�"�|����;�"����x���,3�~�e�D�,M{	�]����o���0�����m�KG�'��+��D�EQwV�A��e���<'�U��v��r���|��\�30��r��AK���q9H������]��7~D)�N7�O����CI#:���t�o��j�V���5�e��D�G�V�k-W�x�T1�Zf�C�r���65�2���@�@���!}L�/]%21,1���1�%����-����)2%
�J�g�z;��C���b6Gb��x
o���4� ��AM*In2
�OC��m[�)r�+	�D����S��^tv�0�~d2��r�h���?����m1��x���t�JM�&�������*��?A�e���M4��q�B�&
H&
�V"���<��	���LK��Y0��
�6�^=9
��s��[2�IK���������A�+]�����R~4���e�����L>�w	�rG����@�����(���k,����:�O�mI����tw�
Q9���
������;mK]H��;_��Z����r��[�1Y��/O�+���O�� ����n��I-8EG���}_��.��r|:q�L�W�W��������ji�'tYvM��}R#������Y���{��0���d�A����u@5�t]��=�B���	E��
�.]�Ej�|�Q�J��A ��T@�����B���$��,�������}��9���
�+%�nL���s.j��H�'	�:gL�V/J4�z�oX��v��iwkVqw��g"�����8�A�E6�������z�pH�@ie�_��8O���5�<\�rsW
�w�"����fpR��
��0J]P?�IL��T�#��L�I<�/��7�}�D�*�*��n����������=���7[�|N�4�5v����-��������=��l�T��!��R���i���vbbo�o�h#��SO�!��K��X���,���$U��r�?z��(?��z����i@�q�wz����th���GGE����v��	���V��v�,�g��:�����s��ni!��=}���,��N�p��g�^��O���f�w��he�H��������Q)��5��^��E��eTp��n.�Q 
�U���U�8��9��7���lX3�s2
��H+~�z���j�n3����_{4���}���q[v�2�T?,,�x�����&�[$gA����7$��J����q>j����@�m�E�����������s�@F��V��&,5�U�0�v�co�v��O��������� ��T��cQ/q�g���L���L�U�W���e�>R���"^�z)���(g�(e�a�QPB��9	�f5<L(��R3�:= �`����ua2�Hle�+�=Z��euW��+D��E���=#��O���<P�Vy���h���]p�q��s�.����_tJrV����#G����`����M��m��b����=���p��e;����r��|�����ajW��fo$����9B��.Z��,�f��F�c���9�w��32����guO�=^���e�L�(��9Of�	6��b����<��
�� ��x��FQ-]����;(x4?< k����l^�j����X������Nw��~��G�
	��^	N�b���*�9�^>����,J�;��/�:�����~|i���
?��w��9����S9�������+j �����(m�A�p����i��H�R���3�t����x��Kp�(���Y��=����ys�% �t��@"�����aDF��M����T�;��
���>ym��\jq��M�;7{�R��(���;Q�}agwDf���n[�
�Um8���6Gz7�}��`�o�<tu��"��L�'���>w�/�-^{���?9#'��hF�m{����hS/J���Sq-�F#�����Q7w�������_��v�,�IM��{�������F����-��W�8+$J���H��vu���B$��#o*��+w�������0>|��^�C���.�����@$_����)E$F����Tx�oDe�j����GvR� �.����JK"���}�@b���#`�28��oE?��:� �v�G�Z����G�7$����p��_�|���$K!~>�Nt�.�jsy���m}V����4��������4�������8A�4��=��K(��+iF]��@ 0���U*�YyZ(����#����A��^��z.J�� `�n����B(1.@J�|��/��p,d5�l�LJ��ad#�WT����V�aw�Is��R/0����h?����W�)�7���F%
�|m�sl�F��S����ZSy'j�S���n�l�����Y�,���r'.&���$����J>/�����S�)�7�\��2�h���
x���F���_em �{��P0b8KUJ�����!�JgS"��
��< �,Q�
���A�V�[�]�.r8��/2�Y�So�&:�y�Iw?���0��Q�-]=��%m?g�.��GD�{fZ�G��.�{F���
�H��������[�g�w� ��c��N�������W���gI~k"MyJY��������O���p�f���^Jw���0]��-�yef��rO%�.5��M��^.�L[���a*W�������rC��#�����+�u���4�I��p�2��v����>U�Q�FB�t������G�c�`�����w��U�'�@8�����|�o����q]�I�v�/]��}u^45��/@�Rx�r�����T���N1�3�)����0�~����g��H�����h>i�!K�N��Sw�������~�v7p<U{�P���4�N�y6�_�z��m8f����-[�R���O<��hI�a����X,����{;C&������ks���.81��i*�7��w�tu
g��+���������@!?Vi2�!-�4_�M,7����J3�P*��X���:�TGkN!O�!U�����K*�&�5���.[/o�V���d�J�������	]M ��5�/s����p8 V�z`�V8;���z,{.89\K\gbJU����L�����p������7r���;S�
���E���d����f��~&�t
��3�����+@�E�{�����H����������1�C�x���&wM�n�G�����S��D��a�$��Zs��}r�D�3��U��:"
}�Mzq�u�KpZ����O�P[)��9�A���j8�q����g���=�I��qM��_l�+�&7�*��*��vPs���)~l&��@^Z)��Rj��:�N�q��D��%sc�|���,S�]{VS5��L�����	�lf�b]g�Y�Gu��w-&�{b�^���m�#6=�)@�@*Cj{HO;+�N��7��3%����V������=YQ������8�Y��G������Dve��z!��^����8�`��������	W�1�Wh�k�<I0�)�?ql��<��r����T����yy����o"�^�k��p�C���k�G!�U�q��C��oh��h��9���*[NK������������
������&`��v���2���Q\��!��Wx�������?�WD�J����������O�J�rb������D��bKW������?�n
3����9��b��(�"�zz[(V��
#hO�K��0S����&)u�`��a�%t�j}T���Ed�%%�[[��\a�0��,����5A�_{5��7F�%�yM����H1�#���G�ri�;�e������X�/�~�<
d�-�j��`��0�7�]�9I<o��M�h��`_#����E�?����}gZEYM����M�)	F���O&������p�Y�:X�bQk�w��B�`�r�����Q�a^�w��b��h����f3���@
f�v���2�$\�~�:��F������2� �Q�������	��h��$�t���6�x�5�������u�h"���G���+z�O;��s��3�xY�u�I_�ZZ��/�|�����F�6JG���E�V�6���K��������\�`��f�������m����#&��]���D������^��BB���xm��\8'j6��Dw�D�v7�j&��'��P��������pT����c�A��h�SnI��l��p`��4}^��v����k��i��w���_��j�RK�\�-�>68'�"�t���������=�T���j��<�M�O_�~��l]VW#�������]�c5�#e���}��)��,t�H����2�8��_���f���>��}��`+��u��.�_i
��3���?�E���>Q������������BSe���]�p���-��d��Sp]F�R�\+���e|�?��x���M������L���������%��zx��R����Q�\������+IW���:�3Iw�G}&�\��d!n�ju������7�^����������/��R���������������r�U[`����V �=�W���/x��mp^����Y*��D�
�'�
%x��	M�o*�)��&�t���x�S�d,\��"�"],a�S�'0�8ZyuV9u�rv\�g�x[[Z�4�H�&Bj�����=4�kB�
��s!<r������MDX�h�w6��*���{�����%T��i�y���#�^g���w�}_���h'�����Lv�<���3Z��sM����
���fDM�&��9WhQ�:	Y��\ )*���4!�"���69��A�U����d���W���a1�`��4��eV�O���3�Y��]"�`�����IT|�U��E��H����~gV���9�����j ��e%kTcao��"������i��>4L���s ��F������\��
���
��G��3jk��u��^&�����k�R3�q'C]�c�����!]m��D���(|��g
MM��`p"����:y��>Z%�^F��h"�X�x�`��Ln���Lk�i@S��E�{��n$P���3������|$��:�m}S��W���b,b?�-�1��D��8�o�!���O��=M�s'&�����>�<.�(4��^M���2���U�#QT�XlT-�n��+��]��6.q6R�--�h��b��$8�����x��X�w[���Qi]9!ZJ�N���Z�b�/�f��������7x7������������g��_S(�������K�S(�Xx������-���7��wfM�m\o�� �.�.j����U��(���"���B��8��2��U��?�0���\�t!p��d�������;h&"��5��/]�$�1�W����%����������D��`�E9E�"6<�m����B��[ c�%ZA�]j����&M��R(��J�=jr�5��Af�#��Sn$2C���:�/w(�{���J�Y���k�;O'h�q���l�|3���#1)P;Q�!�gV��W�4+[a�i�a*[��@�]Mn
�<��I_/=}��sN*Zcu&�X���;�1,��r6nD/�k{R�iMP2Js�I$V�l�:39S�d�����:6@���B���QI���TW�1U��\/���4���+DE�NC*L����C;�K��	O����;n������V4�6�	0��X�d>=�.��+���r 9����W�����n������R�%����������w��O_����]��TN��$��q����8J��7q��;��exY�������[��K���L�?)~
�{��u�KR�����^��n��}����F��64�����#t���r��%��e=���_<Se�;gO3���5�|��y����}y���tX)�V7�HtIL>4��G�G���^$g�r��*</��X�'����xck��HP�������
�5,��8���r�v�����';�ND����~���]���-��u��zR��I�+v'N�5@e�X\���kE��g%�P�`XZ��)�����s3���q����C�b��.l��H�����Vy�A��j�p�S��3����
�vW��Tk���m���S��j��R%%����e��PG�<�J�����1���\bw�k����Am�Lh{<&�M;SJ�l4�#�s�7������V^p�i�o`�k��7��95a����|�l}�W^-+���R�����1��:��Cw`,S�vg}��N^C�����A��y���V�7	�bdjy����.��jd`L�f���aD����tok�����6��5�)�����)�TsW�sb_K6�<(�#��{l����C��(��'	R8e3@_���]�H	.���sZ+�<��P�Wvoi("U�!U�a
g������P\�\2sD��@��?�y����Tn��D1�5��#�E���	���� ��h+]�Km�����A�����}�b�	���~�f�����W�P���93 [R!_%Gb��S��4x78��k���2,��.�y�y�BH�U�p���D�9:�\������3��$�������n���~F}�B�=����C��5����"��@��E0V+����!� �"�`��(���������f�XY����������A�<�7�}��R���bP����������%�b9]�_�uB����H(� ���0jMK��z-�Zo�~>�M��B!n~G���}
���r�rs��0zv����a=�Z��,I��Th��P���NTil9���0�X{/�����y������cX%��]r�m�����`�s��!�����&��-C?O{�2���
j�'x�N��VZ"�����W�o����dJ���v+���������Iy��+Es�����|��,��\iz!��O�=�4�r��r-��24�e�d�S�fn+#��v����}��o���+���H�w����y�U$B�?��W��W�*��FF���u��n��{$vV���{C��@E��Qz�KV��7�,]��MO���M�6��c�?|=��IZTK?YejI)���lW��t�3������Y��6?�������[Y���%�������3�a,����)��8���H�8�&�"�In��|�)�{��r���&O[�)56cu"��^Y�V��R��b�=�"�)0j�(��W��X4:=C4��8�G��
��1��ya����ae)�p��fo���f[�����fa���/h����5,y���&����~�/��VX�b��������jd�����C�����*w���M���/�\�l)0b����h����F��'y:����qyaL���?�����28]J6���������{Q�eK��7��}����i�G���dS�F#?"�4~��z����1���*�	C���y�A�
xR��{��+���������	;��w�$�L���.;T����������%�X���4LX�������b�U"���W��>+�{�P�F�j��Q��B|�4���	�=4*C�Kd#zk������;�2?Y&u����X���s��C�"�8���!,�C��`���r|F�
�2�KQ�:I�,�����m�5�����������R��O/��J3e����!���M���e� ���C
5�@����>b���P��K��T�������%�����k
�Y#����g{�VG
������N�Uc8��\	�%������2J�vZ�w*��Z,g�{�W6_���P3���8D4w�����^�#�B����=�
2���|U�D'�"e������������������f���1���C��_�z9�f+���u�X�y���C��������A��+zma��������d�m�0���\0��m;�"�Y���������X��,�I����ie��L@�_��� p� �Fu����Zg���[��z��C�o�
��Qm������]����b��:����
q�cH���d��
V�x'+yH�B)�6��Up��t����������J?\�W&X��������\����P�4j�\��W��������'��)R�Lm.��LL��������|����<�eLc�d�n�p|�y��������Xv����)����M)�{������1��7Cnf�=r'�FdR@On�k�����T������Y�G�G�"������Zb=�����:5�7���Ke�Ac�=��`����
����q�E}6���@<y����������)>�^��"4������wKA�����J?h��!����
x���7�����������^�Q��s,(��1h�8����j�EL��g4.��R�A��e��D��+�~������<�����y���H��z�����f���A��>G���!�!A5�[O���M�s��(Yt�A�L���_�p�}�����3����[�\p����'>��E�I�����7c�?����_�w�����5�j��D�(�Vv5iam~I��
��i�'o����gA�j���M:��u��/�Q��X&W|�A������92p�j_.�������o��M���"0����������l�^�n0!7������.<�m-��O>vC1]P1�p_1O�k�b�Es�BsJE�M%~��$��5��Wh.)���O��:�N
W2��2�<�3[�(������~Z����)#�R��i����b����$������[�<c%o%���b�������Q��T�� ��8��p����eZ�{���w/��	�s�������U�X��X�
rZ�)	:f�b�b���i�����������^�1*-��H�*����Vd#�>�cdt_�9������|��m���
pcH{R�YE�M?!E���@���#K4�2���6BY�V���	��J��qV�X�j���A���EfW�D���k%�NNA���7?�&~������I��zh8����6�����4
�N��Hw�J�s"��;��B��!!������b�]2�F����$O�2���,����|�E� E4��au�~�����;YY���y��������/�"P�����p���7ejl~������G\��p(oi����-�Id"�[.d�3�+�e���_�r"z�M��`y,�&�d��wi����c����?�e��9���������C�~0�9���S�8��Vy8��rMg��h+Z���o�[;��nm�������wq������*d~�i;��C>������|�TnGl���F�����;�����3� ���qd|�2�k0�����M
#�*�����+<5
v���u2(-�Pq3� ��������F;w�X�+J�G��u�q��p����Y{���_�v*R�]�s��W�f��Q�����]�z� ����m�y���I8�p��8�T���c����6�Fi=�Z�Yp���q	,��]	=_���[�'�$7�����;�\H M�c_#�X"2?3��yEL��4�=��H��|�K�Y\ �lP<�t��+�<�b?��i����V3zv��b�����r��
1��-�eb.�yDqH���[�0,6@H�{cy�8��<rk�6N��3�v&2����D���������)���vp���X��3�����91�	^�L�InK���D_�y<�I��N}��{H�b��d�Q^z)$|�~�c���{H'#����0��=NY�*���@W����%���5����J����~�F����� ��~K)��4�c�wb_[����PvQs�����]���������~G�������s�$^'��^���5������x�K?��.������9Qw��j�L>���x��K��z)k�� oN� +��#hZ�^������������}� �=z$�?�����~"����;��~R�B���'�����iz���i��8���)!b��t�E�R>j������Z��j�G��z�V �=`]�wN~���LXQ�A���"�B�����0m������iq}���nJ���5[����Nf��)��m@S$�H[�r���2[~��v�%8���N�������NgZH�a%
�SyM��{����4��N�c9!/3,�H7�H���]=���������4���"�7�.������������'�2��3��e�-U~���OSI��U8���:Icqe�n�~�YB�.Y��B1�6�6��bi�S���[�C�Y��Y��;�u�_�4J��<�����G�@�DN����PH8�<�������B|#���$������#�s3�	d�a�r:g"�:\�KBvb��$�-+����E�+q
8o�
J8�����q�)������E��l���&�Y�#�
3*y��\��>��
X������	�F�"�^����Av)j9����BB��V)��7�&���*I�����bLR�V�w��*���R:���AZ����B� ��.�f
��	�EK���W�������_�5�4LR8���������������7���%A��
SDpI��tz�d���Vr������s����x��ZK��}Kx���G�����v�0�
'�����J %�-��6�^J�tZ`~��o���j�f���3��s�/mgL��G;�lS�|��5�Y�G����?���2rS%u]�1�����\S�l������e*������X�;
�����4����[#-&8���K�N�&YH��Qr�{>���t���T������?�S�J��1x������g�N�;��V[��I���
���=�.z^�t:��E��q'6/�.a;I���S�4)��S,��a�������m����lT"�H��_���	������r��@�cb&XE1��!�����)v�)���������G!��|&l\��4_Og��*��::l�#�k���@�VL���=]N����e�U�M��#����E��)*x������ �63�-e���y���h>�P�����\T�k�G�b��E�K��T�`���!l1g��YM?,X�c4��OZ�[K�k(�}��o^,ObG(�o�Xg�	H;Q��$����Y���%�D�P�Z�L-f6�g_&�+��W���7����l�x��� ��������[�i
�Qc���|��T���B��i���9i�"����=�����.
���"�P�?Yn?J+L�R���%@��kR����xIBr9R��t�^��%Pu��);�E5����	���Ruw3�gz����e$�g�p#	�
�gn%m���PS���C����T�mL?����4l���P���T�>���������'c�
'&��/E8b��S��	XO����2��K&}�����;�����q�D����Z�Z�VYK�j~d����o$m��#�<�fp�rQ�^tQ��r�$���m�����u{68��<��*��8���1�GJ���Q1k����{FW|U���4�������3�2	�g1-5_To�H�eX���J��v�p�'rd�A^\%�E/C�	�����r0hWOs�I����w�4�h�%���4J���T�-^$����S��6�)60j��trOM3��f���N�$�	�����������.P���CV>��wDwS0A��,D������a�4���v�<{L{�8��{�9]��, ��#���}<j�+�h&I5JD3�2��5iR$�`2�M��j����s��|w�|V��ul6�<�����"�x�M�2���Q��V�(W��-�2o$����������PgS�]#��N1?�{66Fv�IL�L+<�������4�v4�$7a|f�(�Y�2��%���{e��i��L.��� 1Mu���sI�eJ�`��U%���#Z���|���{CJ���Msy���|�I����z�^%Wm�>�.W�]���ak��S�j�	T�R���!�����F�����$BS��a!�����mS����sb�0
+!���W<;�}C��3�~�Q_�\��
�A��L�h	3-"���[�������x�)u�+T��CR8w16�Vp������S�uG9����v�<t��D��R�'x�*&4��s�B�3�L�	Fh)�R���������,��F���UUM��^Do^��
���x�����#c0*��v�6�����]l�rc��p������4��:��B8]��K����0�C>y��Q��X��7��CP��H��Y��M�E��H?���yd6���U�lU��o7�
�!\$�������b��bJ����W���r_P>�uC2N�����sZ>���A(�����b&�m��ob�:��]$@!���
�
�,��s�D���yv��=�6u�m��\��z��'��J�-�����K���2?D&^<��MD4�!�K���-�a��*��S�>���������6[��Ys[��HjO������ kQ��r�k`W����������&��qI��eE�����"�(J\,�2"�[s���*Or�k���A�O�'����3�<!�I`*H/�5-7����u'�8��1�*IJ6B	�-�9�����R�"�������^Ho>;�m�����qa�@J='l�� J$'��K��k�d�������sQ<b�l3P����*��������H��NcY)�IZL���y$���+�/iEuS]1�#�i�P�a=�Gl�F�5�Yz��m���/��<�f����'3�����z �9Gn�LZ�(���(�������T�T-�$�������0`'�"�e��n3�}o���.4=�^zJ8�����u�.h��f�t�q�������J&�~��������H�P��R-���R������>�>�mr�~�+p�k�`2j�=����b�����fsX���H^�i��P���K��~�j�EDt�=�ai���>L�-��q���eQ>fx��mf=���w���u����#�D/wC/��tIN=��c���W��*�eW�8{���~Y�����|}~@VVPt���s�������)���$lB���������%��l�a+�.#��wL�Yw�w�}e������O�kM�?�8��V ���z��<=[(>�}k����"b�l�x���'t��u��[M�Mx����Wya$^2 #��t���L������y�	�D:��3�������i5��)�	���1�l���>[O"�2�{�q�[[gSK(��^W�b�x�9t��a
e-�yT���WW���Q�"7��x�{)%����&\1�q�J���"WJ0��!�W@%����6���7��g�@M�^���bi�ON������81.�u�P��C�Nt�k���m�_��������q�p�h�B�2������)�-�(���2�<C�=�!Q�`���o^zd���?�L�ys���X����o2�s,�z�~����+dI�,��B8W�:��������Ai�)s����8��zy[����#�h�)��|��^�b�g��y_��p6����n/"j��� ��0:R6��X������=d
N��q3y�Y�H�w��a�����j{���4�!a�05�0#e��^��L;��V;��s������i�[��-�(��d��������_��e�5�)`P`�3U<]~<.a+J�]/���_���6�	��h��_8��q����_���oO�M�	D�����f�_�pp�x������z��R�����d���V'����r
F���h��p����=���Q6B�]����m�L��=�}_\��*�]�E`�{Q�����7��q��R�v���:dA9�
�l��u�� ��,�dw���"`�na����B/z�/��"��i��s�A*��14������[ck<�q[
�M�N�e��F�����Y���X������w6Yc�J��Yv�CZ�j�4����!���O���xnn@�H��*9+5
�m����9���O�S�8mF���?u� �?��DS�*�	����k�Nd�6���Ez?]W�0��T�~�������"�!<}��c�D�w�Z��(��*�"�y�+�f����<8 ����'���IZ��EBp��1�YZI�*VT7�S| ���]��LNG�\f��@.�3z2�����R(����n���wq�����8�p��$���������@h���j��{���w`��.����2�H�����$	��*��C(�i�Hb����i^��kO7P�:F:�U=,|2��c��9�SY�Z�����KW_j��0���bN�8�s �]
{�h�r#����|HA�\��'���5���kQc�I�%���F�%�*�t���p�`�.�����DR���TO�)g����������GE�����<i�����W��>�=��v�H���5��<�]|V�v������lV��-���M�����"2e�l��9���0?T7�X����X�2��P�Pn�jX:_�����S����9�g�n�}��_��P���F[J�+�����F���,�A�E�X��zM��C�A�n����C�}��s�n0�
j����f�Y���#�H2�~����
}��B���n�x��|���@K�����{�"Y��K�k"����7N��T:/��(�8�h]���!�\�w��6&s%�c��<r�v��4���a/�uI&�MCt�������� ����������@2�'F���8�U�������������m�����d�|l��������U�����>���?]�N����wr�������[as�F�u�V�	5�����U��p<���MO+g����+W��Rn���N�����PBv�,�#]����0�������M���6��q��?�����
�����.6������r���e?�RH�G����� +>���]��������KG��TEe����c"u���z�����N����:��?�uX���� IZ��
�;�N�4�*S�A��g/�S�� ���0[�i5T�h�&�� ��1�5W�L1<i�_��3X�a&<��W��n����m5���e��sQE�@�.�K�_LY�1+,��O
�>
~����9�p��16��%�g#\�G�m#��n���P��!�6�
��D�%r�@Q��	��u���VO�%�����j�I�w�P����q���dm��Z��V�_���c_nR�����
���Z�n����t~
���U�3We���D�^K^v�u�?vEZ,�u{����nx�G������#������w� �u��8��������*t�10�P�U��k3��������bD](<�bI^��~gP,����q��1�+n	�2��g���_Q��2�����(�Ygv��U��r��C(��-���H�R����;A�I�;c_�3���fr���5��QcZB���2�|��\��V����W]��zL�h��QS�t���<�w1�����%�Z���`�+�47oP[�b��8�� O���i����L���������$�s��	�{��[H��1����j�xb���t�2����T��g��^*����G�r�]f���7vF���uFR�Fjn�G$�	j��<�c�8+qd (|�a>�2	������{6��[U+���g]!���g�����J����A�@8�m6+����������u�
�0N4`���j��p(�9�x{c�|�� �������r�w[��dQ\�;3+g�Q}@��l�����?�c����0��&�G�D7��_ 2�~�����=�|�4�@�����`��U��wrb���(�h
�"�����3���������#��Vg��Uf	�R
'�@�c�����q�R5����p������L=�NC��� K
{e�
l'.����r�:1�C}��_S�����CT-qa�o^��9S�b#�[f9��tu���������k��	$��N�2N	p�#H�j���>n%{�-�b�������'Q�/����Aq��s%��}s�*jg_���r��#W6��H�j]�D���[�#�xM���;�����P<S�dZ`��%@TU��|@�����Y��t'��K����:	���L���h"�Cj9j�Q��1�hV�r4	�4
��KHL#�����!����t�w��������-��C>�7�Q1�/p�g�q	�_�:��%B�J����u�^t5���L�{�?	0�R+�ZQ'3�����&�"bX���8S#@ �`4��K�;"RMy�P.�FII�B���ip�a��N���|�^T�b�;[�V�"������.������h$������z�.������S9l��Z[��Qj�<P!4���Y�w���l����T���VK��8�����_o�8O�����o)�5v������O`�]�R��]����&#�A	O��?[D��s���T8�9�R�����(��f��+��XH���e���P��6��a�G>��k\��R��W6J������2�~��v����^����?�����U3�s�v~����\5��yD������F��"	����gk�L�}p&������p��q���c�-�pG$$cZ��rs�����ke�jRRD�9UA�f#p�9T�m7]S�G]/����0���a�����K#QYJ��REJ���$<���G����{�7t��7_^���������>L#7����?�*�?����������_(�T0��{�,YfJ����Q���l��p����II��H�����������JW�����E%��.��yQ�3���-��������� �mt~��!,���]�D7�����k��u�D|d����%��<9����x	y��Ys�#�>��	NAb���B��������_��L*>��	�r�D .%���f�=��T��a�N�n�
�z��Sg��?�:e����1��S����`����b�&y����X'*Q�@e��K�������-�/�7j���W�{L�E������^p��|��cy�y#u��T_��$sgs+���J�#]��'�@m���2�s�lRPx-`	���jv��
	���W=�u�=J#vS���}<��t�kN�x��#|S*m����Y�I�j��lWG��b>��)�������aiH�����U�,B<K?v5r�{�8��:�h����Dmr6��]�����=�%���W����5L�����K��*\t������g�D�5�
��������cA��b~ W\TzS4&��O\�#M�����M�Q�����U&�&�O����*6��4�E��A�����m{@G������"5�i���J:2��I���ZU+�m��V�y:�,�|pv9�������;����:v��g53@
���g`q��X���W�XOc��I�Y������Kzn���{U�`v����+����i�>yr��}W�(R��@����xJ��q�[C�,����a��%�����qj��k ���^��	�i��rJ�0�������`��Hz$��)A.��O��)�?�a��1Z�A4Tq��A����IlB"��.���>�Xq:~U9�t��x'����;��M�����`eC���_�3�s�U��'?kd,i.z�����%,�p��\]+S}c&^��:G|�,om}�Q����5�k����6p'6A���������8�C���K4����5�PX<K��D�Oi�Rk�*	f��+e.��w(���$��'+����9���
%���g�A���n�q�g^����!�~kD�R�O Vv�%X�z���#���#c������,�l�J����y�>� ���p�fKfx��8iI#
����k����3`Su1�d�,T��a"de�g�����]Q�=7�8���5��*�vR�yy.���uF#��qu������r�j�G��@���>1��+8�N+����������$����������s�?��[V�����T��R��6�����E%��`8.Rf���;5}�8��������
��U�V���R���|�*��L���������=�l�^g	YZ���H�}�t�+R�Y:O����W"nL�^�5b^�K����z�"�j~������*��u�
���E�3<(���d�'����+�=��lD���
`v���,�S�����^���n���a��[t����!}.Y*��3�:����*�B_M���(~1$")����o*�}o]��]��1Y����0��kl��������k��K���q�?u�����	>��� ������d
�������S���UH���vD��!\,�*[��x��|v��e#��_i����n����P}X��=��'���I�|k���:Jh��[�eS����5�+��A>+���\#����4��>��3���K!���6�m7�*��R���5c
��>�X�,N�{b#P�F�X��.d� ��q�n����RR��P��fVG��UM*�@�������n7
Y��
�����R�R��������:�{���8\����}m����c�� ��N�Z����9c�8�
@9�����
��\���u�j6~�I;<������7��*��)��_��y�}1��B��gEg�S��I+�T�0�$��irBX.?nl	���d���N��N�R��V	'pL��c�u��E�5,I���9��^���M������O	�tOeO�MSc�RYU@B�����=�t��$�]/��������j��l>�e����D�������6�c@S�i��9V�u<��,F$>��H�C�;a���+�g�������"m�=�^���/���vp�z��X�p��,`H_�����_g���H��an\���������@C�T�&6)���y�������a�]5���V��tf1�<� 91��V"c�+K�pYj�+�{���J��7��(��7�?E�Z��&Z���h#7�>�^/<�f[�e�l=K ��N������:L����	�.z��nd�c�@�hz)n-MRd"x_��0����<��w� uE���R���������?���-��wB�b����T���t����)V�LZ?�H,=�C��c����a��t��{���jI~c�eOT���RMLC���W�'.���<%��j���VD��n./<��X1��po���g�6d���/���"�@MQ���?u�]m�B?�Y��A��Ra/n!�����6j�0�<&A��/eVW�P��jI0B�T��@�gG�3�4#�i4���k`���Q5'��6� �FEx�'-�k��=��T�d�gd�������&�/�Ml~��/$1�|;k�t=����o���C��yL���	[_��1�,������LN���V��g(2�c��MD�L+�4��m�)t�K��bA�K#l�����I�EB��
g;�������o����4�m������,=��[�d��0�o��OL����!v|�\���f��4'�z�|��Z/A[��^����3�jE��	4��o��kh��X��-�z�G�1o]b�BL^���s���/�z�)�R^�c��Ns���<q�5��7��d���0	�x`��o�i�0���_l���V�
�s��0iH�x=�z���7���
gk2C�V�%n�R�m���D�l�(I8@]�U\��vd�U��ykcI���P�Q�9t��������b�
p;��)�V7E���b�3�R�In+�c���A����g��5�f�\C$*�����>4�������l��!�X�X��y#]���i���J����m�}�Ir5'���(PY�������@N,�[)��%mjaI���\�B�E�:>��{�98���f���f/�k��z����g�[��������}���d�a��Y���zD��h#�Qer�%K�V�yP�`��V�C	��;T� �����<d��
_kO��>�8�!�'���w���v�5Ry�������Q�|�w�G�4�������9:��}�V'IKc5�3��Iq��t_���j8�AwN�C�{��4>���h���h�uq��$T_�p���u
�����}��IYB�#�*�Wk�,Y�D
�d�5
=u:��K��Zy�_��U=Il��e�8�Y�>���u&�4�t��*������Kl���ue���+[
� �Rl	�ep/�
x\
H8��f���O"jVn��(���H�R�?��1DGWd����^����.&���D[������m������� �hc��>�BH`8�V�k�p;�GLJ��VV��>�����e��P�l��*:��9-`���l�j�Z8� �����?���N�D��g���Fc7mg0�D7�����Qp��E���v����.s�L$�`S�HP��`�!zBF��(�P/�B�R��C��(i�a�����J�D���%V/�[���h�-r�
�R����r����l��B#�D��B�/>Y1M���5������1-�RI�3)����Mhj��������"z����S_U�z?�'g�H��#��l��A��qN���� Qc8�a�1<3k��.d:S����J��J��b��o��\�C�%e�����j���E�����*iD�W���a4ioW�8�Us�-���`�vx=jnVw����
nwo�+�+s������,�j"��zt�d�)��e��-p���\F�#��
������b��F�[�C�!Gk@�DD�
'��f�)���y6n%F��'��72/�)�s�$�Xo�DpUEWE�q���Q6'���Z�EY}�<A5��	E�+�h��+4=�qm��:��>�~�E�e�`�uH@^%/w$�n^C@m�r��yP'���g
��s�� 5�Y#��k(�����4��o����F������M2��MR��$��J�����G��o�[=B�-��p�����T��;Y&F��p��Yk#�94�N�zi$OH�I��v��[�!�����6�O�P��F<w2��PE�+�?z<P�����Msu_@Q����@By������	�p�-��k'U���������<���ruV�'����������aP7�yw����L�5���Le��(�h@������FL�/��YH�\6�:H�.%�VZ6�`r&�u��	n�wEV�_��N�V�z�?��8�e�s&,���l����UG�4��Ht���-���Q��r�5T�.�x���A��5����V)��M}
�R��V��1��!���{�-��&U�/���`�}�����9��5d��w�Z�u�|*�?�[�����,���K���Fu�a)��jNI�� /f�����grF����{=^�+P2�+(�[i���a+����"��LL��Z�mi�%�L���I.:���]���G��uT��! W3��K5��]���z��=��ekOf���#U>j�!��ku�n�� 91o��B�Fn���Vy6=?Z�����{h������|����s��!�t��4�x�^\c���4�c���i��x>|���	���G!6��W��|�2���Y��Z�p����[�f9��m�f��E����<KuT��^�z�
�V�-X����T�^�{t�(�y�Q�������<���N�����
���7-���/�dXSf*��+~�h�Nlo�w��S��e'�%?B�	YZ���K�������$�9������k����>��$�e�FZqe�GU��;�����*��c���/��	�ms���������,�7{�ST�z��_/������h�q�6���-��'���=/O�<^l���Xcg�x��r�'��Y=���Z!��]e� <����GG�A�-�������� "�*e�u%W�,�H?w���%��6*��-T(�^�\<�������;��3�~�J�����[������}#������16,y8,d�`�V�
�����D�'�Q3_@���}'\*��`s&-e���.�a��d�V.��p�I&�����)�P��o�
k�!�I+Q�J�?W�Z��AQ��
:�m7Z�Y���1��a��z7d)�FS
����P�n^U[Y��l^_����{"3����,�NvS��0m)��Q%�[�^�#�zH@��*�+#��ky	���c�U�
�j��ok�rln���Rvm8���GA�����c���w%H�6���4+�b���H���&�p�@Y�<��.$�<�x��/R�iUp�"��	�@=r��/�mH�.%��Dp��U%�t^jQTS�(V���!-WN�����L�0Kg095��yQ��^�5��H����h�+�M��>��N��5$���r�m������n��Y�N�]�6�3C�R��5^�R'��F�����lx}����,����YG*�#-����x���H����Ob%�A�5���j�<��2�Z�p�h�2:>�����n��2���Z /��?���V/�dbZ�$�Z���T�djc�6��T�!j`���r�����;H�e����}�v���B,�*a}��j�Au!r�#}����������j�0��b��!��]������e����?c�T����|,Y��8(Dsn��J��4u�^��D���Z��J�����5�yg���J|�����VP�������SB-8��P2�����NqDg�F��x�Gq��+��F���!�NT�Kb��F���B��@�YC�~�|uX�x�U#��D���r�&�]M��X��J��;�dHj��-qK�c	E�OOY��*�������m�u�2�~���g��i�6�DM�����.��m������&m�`��}]�=]����a�����h��C�O�E��TE)8�}^/)O9��C��+�%.	�Y����)��- �%��S����|f(�"�F�$q�{�^�Y�97�>����V�>�������2(�W5�hY�����9-kg7����o�\�����J8�`C=�d
�M7��dp�b��!,�U�_��y��f$�������f1��+�b�T����8l�\e�����ZR���(I\��z9��,;�*�d�</W�0zp��J�7���~@<���X��J��8�G�08���x��@��PBV�����&����bgZ'��V���V=�~��}Jnkl�Jv��/*kJ7��Pr�	P�g_��k;�TE�j�I�Ljlb�V�t���_YsLcP��:���t������{T�.��1(:�7H��{�L��)��:����U�t"Bf���{D}cX�wn���P%#�8��'��`ms�!E������9���@�	-��"���W���E�PToRF�B���4�������^��e���e �{E��P�[���!�E�����	��8�Rg����t��E�������R"���/{�i5.�J�0S�J5q �+<���d�|0O��N�S.#�$9*OU��+~����Z����j�q��ld�~�����3�-vW��,���Z�}��~,��������dg��&�IEq���4���#J+���L�����:������3���M�A�'��MR�������Wu��`���t�?'A�<7���nHe�r��c�[kP������QO��
��m�t��y��f���xv����X�;�$D3!�8�R�j=
*�y�f���N3d���J8��N��b����/o��b0�G�f��X}BP���@m4�Gg9�c����/JU��*o�K���O������<����e�.h�1�8�	7N��w�2"&��PI��k�E�8T����UO*�8�x1f��yG�U)���9��9k1�d)�Z���5������.���+i�?k�7	��g�K�=��)}x����TS��v���$�(��L4�z�[�z���y'�X�������,q8+��o�\�����D44r�w��!��L4�2F5�]��V����u��p�n/V�tof^��b�Z8P|��%�����HRAD�kfKFo�t!EW���:D���LV��"�����;a�|�F���"��r����>/ZX�5��@�1��u���3���QSw�6�R�D�N��������VA�(IJ�Q���{"�(�V�S����M�����yG������z����
O/����"�W?����1t��n$c��B.�e����8��<�_/�/'��DR��2������p3�����Y�t��z]�������}���qd|�V8�����!��\�+"��{�(�|)��tA�JQ�[�		#r�w,�m�GDn|��r������ga��q�C�����E�f/.-;��X���\����$����A,��p���!�$s��q�h"u����j��d#��u�b�S#�5!�������w�C4�>����^�9�Y���
2Ax��������m��K�g��>�Z�y�oP���B���O��=k�����a��m+N�R�<;LET�\�o.��.y�a���B�R9���r��	b
)����]��q0���[�3T�����{�������Jp��S�j�/�E���������a�[?6N����*-�����=����h�Cz�`�������@���5�������>�	�l���l�����f��q�\���P(|���S���ra�����n���\�Z��!����o1Q����<o���[ �D��R����1/���B���m��OVI=��?�
.E���AsD[D�B�1�����}��1I��Wr��U��$K\cs�F��n���D�$��������A�#*������$���(�`�H)Ng�b��b#�)0
�(O�1������B:�O�*��LM��S�o��)���������>=�*���!I*������kg�DR��L+�'Q#a�����8�Nkb����$�r�1;�=go?���1��`��y�$�J��8���E�q9���\4�(11��C�����z�����WKU��PEt�|s6-Qc9����j�$���Bq=
8���P���!
�,5I������P�D���^�w���9j�����>��@tb�E�"�c���������R���&s��?H�N������mt��^���.}���q�22��g�v���y���7t�o<Er:�;y����������a��j��F2�	]��Y���i����^c�3��[����*6�����7w�3v������p��.b�5*i�:&����0����8`�H4���w$5���W�iB�!F��)��""=�D?T����$��[�w����^�XF?|��/�z�g3dM��)�z�Z-����K�K�Q�@_�U��Hw���*Mw�8R(�����M���{n����\w�:�W5q��/�G�~���~������7���m7�6�����.��%��n�r�{U���R3lT��W3�
w�\[�W�\�H?u�U���Ym�������R�*����R��^��zm^�����}p�<��y���T��,f��	���`�Y������$���$��8�r�:>�����Qb��q�+`����^�����F�# v�I���C+N�9MD/��|w��]��Hf�
�*V�����{�L�7|��_��u"]X�nd��e���������;\�0��s�������THu5������C�����Of+O�E��i�?:Y����z��4��B��p�������[��+��G;r39C�'��)+����&���0�g�<G�������d��F~}��|q��9��'f)����L�{u�d�����j)�g�SM�^��AE$Afa�3�79[%Ue�1�F��=b���b>��K}������Y*������i��B������� �.�sTt��o�ia$�m��4n�U����.�`^OG��mI�S����o�w��~ ;�ThK���UE��zr:>�i�Y����*V4��0B�����K5�+E�0��|� e'�i�@��:u2��b���\Wu�O	_C6�M!�i'�'�+����Bz���l��T	��![A8w����R��U%�-5�t�U��$���3S�i�0�H�7�@q�dm���bK�8	w��}D��LT���"$�1<5*��kA�F�}��T��b���
>����
_�������;�%R���F'�|��`��}j�`�t���R�rK������'���@x��~��*��<���%��	����Pm���gHPy�^Pq*@�w�����{5�b����p�������#(H�h7�t%���H>$���U�����"%�2�`xr"c�.��%F69����(Ub��
6����7���z�]�Ae��OFu��%H�K�m�������	M�� ��#������b�Q�*��A����|���_�W��%��|�R-%��������6OdB��w����� )�:oH|��.�l�&��]���~H_�nt���^y~�f��a3�~������h6����x�F�����t���>�O����91�<~W�N�W��
^����R�_Hf���`�8���k�b��EF�	w-D"@���)R%�'����Z�6!&�s�$�m�R��,��N�9�D�Z��\��5��m��n�T�M/��Vd��U����[���(���G=]e"c	G��,�$�]�S*G�X��X�a� �|�q\���F_�`���C�T��=�j�$~�}��F�KfVb�+�����'!J.�d�v&
V*��(PK��D���M���0�s��+?�f����8���z�5%K X������S�2W��z�#i%/��r>�Y�
���.�97��H&��1=v�/&��ec+��D]�rI���*�?>n��	3�`yQ� gd@����$]N���qU���IS����PY�T�?�Z�����	�<����7Yb8���i���/Q�\I�$w�LfG�y]{U��{��:���3�o�-N�\�Z��!"c�SaH���:n�D
�	�'$����C:	��Ur��4���|�Q���kE�'�����5{�M���)$��EQG���i=T����5l�J-�M^���h#W��N�C��	��n�������d�}�����6��fP��b�z�S'T�Uw�_K�`*$�C��?���1'�F����C�"t�H�����:]/y�v�&�)(C�IO�X�>������k�#�l:Y�r�����
:��=�[��I�:�F��d�+��r�3���]���mzPG�!�/}y!��c�8���jEA�;�<[.Q�o}�Y������9�����?��b�����6��bp�j�YP�XYg��=��K]��(�
���4��c*�<B������a�>��y����h#�N�����a[������i�I;3��P�v��$��6KvD�X��_=[���5�4]Y���|{��7<�{������!�7(�����a�
6'��!B���G�|}D������p�����P���!�]�N��ii�D��Q/'������7�m�]c�~�a�ss�����R�\9X�t�p��HS�I��6�gl
����������7��+�z�cHO�^&��!KMx����a�J5����x�� 7����Eq�w>��������<1�w����W#4��o���c�n�&I]���1�f�-I��X7�p��eT��$P�R#�18�m3SD���OswU��c�����y:C��/j9�"�����K3��T�H�|�,l .%�+��g���Y ��*�����,�x�p�0X�&x�UI���4�0�o [&��u�������c��j�	�}�:0d���^e����Y��	��H;��������j���s)�J")4��i������D���������~��>VBb���uj{>�Fz47�~��8��k,
w��h�d��k��_8���Q)J!y��&}S��	A�+��i�[�����[��":*�l(H1�|����{��CE1D���8���e�7��X�X�ep6li�6=2&_��-��h�=�+�"6OV��#����"��l:���>��Q�}n��\��z0��4���9�86�Ou������F�eRgX�R�4=�i8�;N�q���]�p��Dm�D� �f�Lb�bJ!���r�_'I����5�0��8DH��xg����v�#"�����0��:�0�K�g���D5bV�1�t>��>���]+�
��*i��-2RJ��R����S�k~��NAH:�$���m�.���Cf	�{')`�������B�=>k7|�����2_���K"h��Cf�bL���t�&�
��W�'X��\�����a��[����N�4v4������w��c�8pt���6.���"�t����3��$���3gc����%��];����
K{���!��t��9���[��)
��Q.2*q��I=:��
��"��}$���qQ��:��XJ�
���9�L5���\�Vy����t��[��T����<���&�k�A�����h�J��Q��9G�*�*b�3'yF��f�tU���!��4�2������IQ�����X�`FD��<DjDF��mH����K�/~D
��2M5Y��;?�q7���n�OJ+sC�������=��s5����P�K�x]����2�?�C����p�������C��,c�pH�`f�5���St1���3!�R��4]�-�p$"�����V��C!�=��%2�����5��U)]���dF )x$�h�g*�n��7�4_�o��8:��`�����:.�W�2\N���'|�u%b[��������8B:���2���t����O3�OV�bI�p�o��$^�H�cq/�_�W}����)7I��Z���[���dA��TfJ����y������<�������%���4*��%�(q��.�a1�:�Q���W���W��H�����vU��N6q��/ �Kk����&���I��(q�X��^h�|���&������,�Q���V��?��|�p�=w�"����i=A�P�Z��P�~$^T��k=�^��5	��M�NX
�$�19����
�oh�����v��?��?h8��O�^�����2���<8�^�(��p@@�~����=�e�.��~A���ST���t�8Le��c����<�����	��0Z�yE�v�v�h��~��"��w��,��z�� ���E�U���D�$�Lj�{��h�i��7��	����gPv)�$�oA�?�zQ#z�����p:>I���������G�n�*��F��2TM��/m`��?���T�����%�)�8����
�����l��9����������/)34O"GZ�&_�N�������B���:��sz]�"�T_Z5q�rJ�qQ��
����w����U�����+%O�+��m7/����{N�g	|�)�zzf�U ����������xRgK~�X�T^v�(�E�q��N�&���+!��t�z�(��7��l��lw� ���&���@lu9�f�wg�j�G^�O�g�qpT�_�����/N��\�\��;��f�B�����|�8�
fW�O0�x�l��
�w��df�G�5��?}���=��"�����$a�M8K�Qf��EV�����&;=T�k��D}*�R#~k�z�w<�g��d97c�	�n%��s2;~��
�VM�&�(�=0��.;8[X����vHq��boVv�����@�<������
Td������KH����/�������k�����,��~�S�n���PK��]�@��$��_�
�WB�"h�^}����2�� ^�*���I��r���W�U����e����z���G��@4���G�O�
��q�����+�����������=������P2����O��s�#��"e������~�7`�TPE����cB4�PD
�1�h��S��H���g\������_��yX\��I�Y�������mv�����	�����E�������2Z��c��F��T$��_��
	�X�8�v!�L��v�'�y�6(�������_�dW��s��L�}���Q�pD�D��j��K��If�� ��������Tv�,Z�/J��/b{��A�P���no���W�����F�����4r_��|���Dl�����A�7�Ax�=�#��=p���������F"NG�[�����8����y�Q��x�/���(Gk�45�osf��5�P>n
���.&��f-��*6jkcj�{����,z/���2N������������Hd�W�Pc�p��OBv@���A�������N������^3��<�����f�������Wb����ev6��v��/F�6�v�F��W��w	��m~M~;=�k��������u��~�~�
$v����2R�*.��v�oq�o�i���qh�-<��������;�����������&�m���
k�;"m���;��������'��e�w��J��.��oL��W��o
�����M�++��@�W�[)���12;8��u��!����,4|����K�2u~�v��T���,��9n����CP�����%�������#��m|w`w��o������.�s�O�n{S�������z���:���HCwC���Q��"6�����%W��j���4���q�0����
C�����}1_#����S��'�>��������qT�N f{�����;�j�u�����{��������>���%��M���q�k?���f��;srY��HuW����5�Y	�7]�d|��1�������-�k�~&����mp|c�-����� X�|��.���=�����ul�n&k�RW����H�Ql�e���&�N�'}��[����]�MC���My�^�����[�����f�a��J�Y��d�xK���!�����b����x�3�j�/�VFa�E��yh����g������oMA�l��Dfza?���S�d��;-<�/"~yG�;|��oY����O��gij�K>�\�����"g��$o)����`�C!<u���{���`����Y d���������^����D�P�}�G^�������{�C�x^x^�-^�Z!<>��g����?���_U(�'3L6YL����9�{���j}:�e*Z������������6]�;�����/�;���Nn�H�w� k�m�2����VmmJ�����W��FtjEN�[�Me��7�Q��t������V����������k�m�V��J�8W+�Q��mXOh-"�2�� ���=�)���P��Y@����T"�]!��x��.��P��������5��r�N��OG-Z~.M^<�9��vW�2�����5�YJg��'�Za�'���b���%�J���[���m^y���M����=�q[4�����6��c��c>MX������,xS�3/#mTzI���I�C������_����{N��.KNu��O�v���_�kz6����a�4�����8ke'�� s|���O���d� 5��
y�[�9$<�{l����E���Z�B�r�����h_��F6�V��s���M�p��x��q��'3��,���6�,�n\7�n��.&$���n|{|���3�g�����kF8&)�U�q��O&0��=��4l[����a���������[9;���'�u-:�P��q��4�z43�[Fz�����~U�EE|����<m"g�`��K�W�K�����]�����h����;�>��\q���`�V�B��B1� �Qu�
�o������5��<�����E}�Z7������,��g��
��#�{W���P_��^�Y�v{���i�|i��7T;J�O�k;�S�I]O�&Q��@�'b*�. L�t�:<�L����%�U!������=�5����^����H~��&�X3P����5��7o��>��N_$o	bh�/��r�4\�#���W�����9K����9C�n�.K��q6���j�������c�y��5�fY���<3e��
��W�~��<U^��o�B�g����G�e?0����z�+z���7���,������7i�n�#����1�D���*�V�v�C392���/9�f�x���b��Rbv��_�3��T����<G7�+h,e�a�;u6��t����_��uc�4��������:�#�x{G���h�)E#p�kRW�Z���y��T�8l�=�J��P�v@��I�Y������k��1��S�fM� �`���������o#��=e8�)�M����n����
�S���?L5�h��K���8��f���`���0�&����x��^�lV�,�gY�T���N��b����(�J�X�����|Gal�mR��p���7�������L,�[�j?���k�;�.����IY���I;O_	���?���<N�
�:V,g���}6	eU���H'�&�!sj���S�p�ll~x��^z���:�#M���8�r��\��B+�]��W��`����l+���2�>m	l`�����)9�vB���\��o����'�`�>��)�5���j��p�B.�|�a"�~�'DO)J����J!V�K��v)�/��X��~�_��a���8_WA�3j[�>������X��5��;�Gz)&��&�ds1�(�Z�o��8����	,�4��{�
@O~:�]���+�_��g�Yt���]�41!���C����������M�5�
��8�����w~�e�"�u�lN�d��D-hQ��K|�f�T�nf��{�W:b��'�����UgT�B�+r=O��[F�J�ce���@���L����K�|�aX���0b��qe;���8�'��T-�V�X����w&<��Ni��A��.cf�2����1��;I�5`�([���V� w�Pbr:�w�9��&�9�bB�LzK���"�c�Y�q���W���x;>l*(c)��R4��#�������R!��������
��-�9WL�eI|�
�������Woq�
�S	;h:+���	��j���+�����C�;�����
��B,�������?Gp*���D���E��E����sU��K^r��3��wO��sN��n��P��x}<!5��!b�9��9�rsC���
a�N���{��,���Ar�qK��.�����7��N���pg($�s
�a�1e7�g������]G9�" �	��t���F'�&L=�c0-��v�
���Es#mt���)�O)T��|���y��o��LU=�H��NB�RV��}+��1NU�%0rC"��*s?��b�O� ������Y2F}���o3����9y�����h#�5��.�G�3b;

������.Up8�eM��L��������Gh��*%H�4q��
'�^���S����ln��|�:_�`���lEl�iO���jE;��R���Z�����/[��s��*k;�,u�&�]4�}��[�����!�I��W�tdQ��:���z#��		|�,UOf��xg����y������h��i-'�jL�����BZ������)�`����Y���
�/����d2Xq�����=� �e��<�mc~��BK��qf�t,l��59���!�����q7�]"5�L~�����K	'�F���-(0�s��N�3d���2��&C�������Pj��]=�z�M�tt)Z#����8p��0������z��,��KH��B�R�������g�{�:��w�����v?O���R�ql	05��4�����p2�N�s%S�\�/��������`�������3���������)b�����4e��3�Y<mj2�aE7�8�����-w���^�����i���*E�Iu��a�6���'�z�wnG����V�������2��2=I��i���k���D��X�h6@��f�g�����N#���p|k���V(7%p�V�K�"V2�lR[3k�F �O�����$nE�_������]�����[kH"�I����n�%��Of6���L�������<����(c�Zx�l�h�w^�q���2P�����*4{�!W0����ENd���5<���g�H�!~Z/6)�]B��7�q	�n��;��k�2|_3!���|+��"bK3
�*<�A������{<c����.��n\@I�����t2?.�����7>��Y�M��]�K���	����M�-��Er_�@1_
���/�'8���E����u�1���v�cX�/�dH�[U�g��O�6��0�@�J��G2�����������#[���a?�����m�W���f�q�����~@����'%���3�m��;2V�;��cZ�f��/�K6���]�����~��0_��,���@�^��1�������y�n������y���i��H&z�5��u�1*��j��T���dN��N�:��bgyZ��������unG}�8;�j��u$o������������S�<�"��c�+����o�=�����
I���Y:��l����x���n����r5�KX��6<-�m�d-�S9���5�;W�^o�<�� ��vi�3]7������c;����������9��v�P���O~�x�%g����2��e=�e������)��)%T���<~T�w�u�P�X�]�I���sv�D�R�+����|��Z����8��xrG����������a�:�;�����^�s
�b��5�+����#���(]���f)a#-���Qp��wU56���/�Z�=8V�(&��W1z��[���)W3�q���(�t2��P�\e�Xd��"�y��|#5�o���7�M����m�P�����{���Q���#=����:4�xdW�sUN�aP�?mN:��B��}����U]qmc�2Ff�����G/������"����6E33%����
R�L]�e���t���$�~W�xt�=#_+<rr���������,c���G6g��]$�B��������C"�����S�~���^F~QZ��c>���K�I�q��h���R�u�_���1%�2.!`�Y�K�B���S<���-#��tJ������<%��4^����y�9`'�"�%bg�9;:��(�g���+�2^��U���G�FJ�a>���?	�	�No����Z�-��x�8�����9g�R����T���j.6F��*��Z�|`z?�z��_����'u���##"_�V��m��[��Z�6��d�Ev�.p��)gS�q�z���}���/+j&�o����������gb��
����(��8�`f�RJ�Vyi	��N���@e��=����9���U��i1I�$��z�����W���.:��J����9���!|���{��������8��2wL�S�as���m����X.1o�lo"_�����/#}�>��<���;I��oUs�1rs�\6e��><�����n|��lM4^0,���"u��(������#�W��G���9Vz����UKj��������8.��/�\���h��2�B�T��-�i|i��

�-�"��~�O��g�����������=d���T1���H�ep�<TQ����N���mp��jk;@���I�n�_~�9�q�g> V��X���
�jn�����[\��.@� [���ss�������\w���d��O�p�S���d�����n&�L�:�/����IWf��[E��l
:o�)pK��O�D���[v��;�)���]�����l�x�1�:���a��m6�����rR�:*�emM�[�?=�-����Oi�f�L��\m�bb������/�G�|���j�6�����(?�������2GdKe>��l������V9�P�|����\b�\��h��0�������������[4���6��r�#���-����g#���u����Z`B�E��,ap'4^�3��������V���z��Z��o�4�Z���)jiks���9.����_���� �?�������>qY�>S��������b�7�k�g�R��Z�~������	��J|���n$mv�T����}W;�v^��{�����l_�)D,10�^��"g��^\�;�L�c��S��z�a�n7����W 5��o-e`��Xz�nV��k������F�����>��j��e�Y������/��r�~��4�_[(7h�^�X	D,�H��m�n!d��9����K������}AU��1V�J�wJ����������#e�����G��lcv��u��=��O�I.���T;E�C�9[�}H��������A�}}��^���;&�O_�]V�h�k�u���g#�/�3?���5����������$���.h3;��K&�d�|���������W�
���_��Sh�,�b������"�,���aI	�,[�������^��G&�~��!��X���o����`_��i�����z��o��{�Ho���`�fqV�,�Zq����I#�'W8����2������9kj?�JMg�+��_wA}6�*��H����R�M!�����e��[L_9^���6�U�^��V=�y�_7��ga�Cl(}�!��LU������������i��_�P����wb���Hp�
1�w���Jb��������5zssny�������_;i�6t��S����pk�w�;��~�g�$�CG?KR�����n_�D(&�1l_x�'�����;����Fl��s^�|��X��P]��)��i:��1�e�x���sa�S��1���+[l��c���/�����er#�s���W���%�+p����;�������qpc�/H�����p�.�!u��Y�/���=����h���v�����ec��u���O]�;1�j(���D�����T������;�����U�u���Ko=���g��!S�{��X���,��n?��Sh��P�uo�+lyE�1E�-�M��7;�*]V����n���2d���!��8+����,�4�k���:��������|�2����:_�����e�8~���}a�iy��k�}��\�������
m�-�G�_��6��(��Hpw	08�����-�Cp������w�����=��g�}��p>�N�������_YW���Kc��I�Pd
��*��������J2���n���=�6��PO����������K�[�QL�h�I�C��r���T�M���)tZ����/
��������p ��RJH�g�����*�� 3�6������
_��+�����c5dJ���������H/��Q�������Q�*SA�6��_�S	���L)��Il��t��d���f�>�4��D�3����q7��di�5���o��I�R,��bx��f��n��|T�#�;���D�,6A�TV��G0��|��3�E���p��������L�W�|l?��J�\��6����Z�#y7�P�����r�in�-g��vmf����������>B<9� ��O����du_Pq�������y�����\����Fm&n�A�K�\��$�*9��V��R�bB5����jph�h	-��O^Dy�k������g�;c�p�������Q65S�c�^�l\�����rTG�>Z��7��������9���,v'ut)�����b��)'^u��R]#��+G=���_���5�W<�ECW��@����
����b|,5d������3�7��Afx
�g��:E����S�w�w()�&,[���m/��+���||�J�5q���z��\�>"+Kz{+V������/�j�2�O5&n�\���Q����nOypy}��+�F��c��o�T�=Ro��j�(�=�
��D2<}�
(xh�|}�J
���:p���$�[J��
q}���
/7O}��=|�����/�o����e }Z}��o�Nn/��8l��[��|��i�t�L%��U��d]�����-�4��|���2Y�N���a��a��lE����Z���H���[��bf��,�e��1$�"�fk�M���5IN��5���-���K�K�t%���,�
��<�y����H�/�rmF#}�TWm����n���ng_����L���|���\�n�9�'6�r�*�.b��}w��{E6�V[�F'����L%m���3�H�m���3�0}�x��V�1�8e�;����e�V�5��J���������h�.�L�������c�>�Lzr����q��� xe��r���{�Ypn&��y{18.~�d)-r�dl:�?�m=\ac��g��g���J���R�������e�>��^���?�?B��4�W���H�E���R�L��T�����6c3�	
�7S��]��cJ��]���Z�2F��-�!�3f�� O�u*����U�
��T��K1M�0�$��8	~F�h�po��(������_����L���b���=�����?R��Mr�r��S)�?��f��"������i�!�R_���.4
�f�HT�����]Q�6�C�LP�}���=P��&a���J��khAq�!������_�$t�I0$)W�l9i�������=����p�&LQj����2n���jG�G+��X��L#a�Wj:L.�/�GQ�(oy5�qs&��T�}�����fQ1z�R"F ��K������u�����H�7�r�HM��&,����w��]�J�_��*�k�e��Zp.:��E�\P�
��+��0�HR��F�a��y�T�����Y1���Y��k�`��C%u$+�Xue�JE����aw��/a5�9�(���{��?i�U��SH�%���:���:�Y[���Z���������;�'�^|�xl
�a��{|��FA�<v������������o8US�qc��SN0l�}H�yD�]n�x�}����fn�p�����>�
z�(�����~i_l����g�{����[��3�/k�"C{���o���?VJ�}S�E\���<����r��c��.�|�f3��T[�TE~%c������QN����}��Sq��QV�6���Lf������e�[XJ����s���N�^�d�g���PDF�����l ��;|
P+[�1e���Zy�UW.mI����U��$��aQv��y�G����R�K���~�����7��Y��#�_�%��G�Z�!�c
G��g� ��W7�XF%�"�����F������H�U�#
��1�o�3���q�$s��G'��Nkn����l�����U�!��B����&���M��0�FX��H5�
'����R8kI��8��[����6��+��&��fn���4�K������A�K~���|56j�g�hI1�"�Z����G�W��^�eI�>)���"w&6�<<_�n3@�}]����I�����*�#pL�i����������������P�v��G����W8��k+��%�,��Z1
K=�f4���]�
���'�q��WQ|��K�R�B���Xi��4�b�-�-r���P�����j����N,��Qks��  ��k
P*��@��������������,?��n�x�O��k�R�����\�O�@�t/n�/�82,K��������Ml����m���f���r��1#\�>duF��-�5'V�K����4�]9�RF��J1�y5�@<�S^�[F��|����V�%��1T�������{�z�|�����1��?"������99o9w����n	mU���)Z��z��S�&�j�����T�������#��5��(2
o��$e��-��o�J]F����e��*�:9��������0��9	\�x�XS��&�������=�(���������Ds�y����E�1�
��\��#���<��"�%�����$r�]������������x��j���y�HQ[�/U��Z,a*P5M���+g��%�:��a�!�{�K���Bo6�v-����U��m��~e����5��w[�>����/�f�}C}�"m�h��#'�P��Z�ak�a�:��n?��/�����N���Dn���3%���3x���fs�����~\cxGve����$���fw�������'�Y��J�F��E�b������)���1i_@��k��s���(U� ]���Y�7���H����p-)���\k���}�"-��2aV4����f���1�w�;���:�	��U"�FBz����9��+
-��Td�G�+kg���aM�7]qq�/����Z���9�UK�	���H�������0�9E^��*(���M�n����,2�wd�o�c��bsJ���E��5�e�T�8���������5�����Mi�7m���������;Et�H��F�����b�Q�U��n5�J��������j.���t�a��V�,�W�&\��D1���4��2tr�M9�<`C�XR�P����U��n?A������W�^:�`w��.�?SV8nxN0|�Z*�o���l:��~�S[��#�l::�V�����cK.+�9%e�v�o�'��^2m��]SW�k%��/�*("�*��z�(�^�d�&�l/V��N���^�0Ifl]�]<B��l6���Z)�^�a2����R$�N���+=K�x���h��(�C������+��w1������S(�������;��u�'1u��2gx�DM�FA��S�����L���9�92�4���t�E@�zX��F����VN)�S���&���1�@D��7�K��	�m^z^�8�,F�je�u0p��sF���)���~MUA���J��f����#�[������0����-�a��u��Mml��������)�;�0R���R�k��4a���I�7������%0un�������������������v�TCw\V��Z;��!*��SM��h�v�D��Sb�.�#B���v�����s����W-C���X�����a����W�����U�!�&ycW�����
��j�����wV9V�����U�����R��c����S��:��%F���+��G���[Wcm���Ez4�����O�t*��G��_����Hq��u�,m�L�����i�0M|�����]-�#9J,����X����y�Q���dD��~�#���R��4����~=��#��X`�,f�Z�kc���U{�b[�K�����Y���j+�IQ�x��������3����"�<�X�S�.7��#�w*��<)������OwZ1��'�(s6]�H�����)Db�R�hn(
��]t�����>{��_G[�>�.����Wj��q����u�xIx���� �js������`%���2"����m��	��M����(w>!�D�����M��X�H7��D:��{^a���BM��l�A-�>����ox*�V>�~�5j�g��a�����-�v���x\k��TZ�h���R59'�
S��f�r<����a��i
��5cC�[D��:S�)}Q;n��/1?�D��T�[[�c�0�]�,�{�Q%��]��6�e[/V��KLE"��m�����MG���U��me�8�pO�cajF���x�M$��e��y{"����?�������|��"0�&b�'����	��Q�K��Ov�����|8:>{�?n�']<�M��t���!���{����<d����=n{�>���1=}���Za����@��r��x���~G���C��Y!��!i��tq1g���qu�w�X�%��S�V��
��m�)�����p��[y�K
{�;���@��%w%1��O�)������I{>w����+FwJ����f��LJ��4o&�������Js,q���,VnE���R�21�9H�F�Ej��Z�X�����i=N5�e�F���	�������s�`���^�xbEN�g<�f����>�^�h��J1��g"l���B�;	�;K(��0��#b'�+o���uh��4
���g������[`E2��s~��pb1�<��3���������������C������`G+f$�6�h;T-E����8��W���M��������?�p���QC�a�=��H�q�'�]��h\���M�����>��%��vI8��4����p�(]����a��D�u��T*�K$b����_��u��I��c��Or�0s����4k�g������hvc��3�8�`���;Ngga��Q�����]��t]_^Y)66�V�{���1=kG&+G�H��nX/������gzEa������MN_!�rv9m;�3����Cq;�
�������h��e�Q���;RG���������m-?s���sV�u����������O-���w�,��;�+}����-@���$,����SK 0L��|�5���#�|��nr�=QHM�f�a��b���okY$�at��N���{����z*�3f��uh��B&G
�N��3IG�,����JS�
u�|'��Z��?��|T�E{���n�'R��
�C �9�������E�Q.o��D_u�<EE�g���@V����=�Z
[vO'����P���k=������d�q������|��bo�����[�]���\>*����9�"���'��������!�(g����|&��/e�|+������"��dg�"����o��n�Y�g�j�o���}t��o��g��e�z{���ws�e���}2���,pkk)�7CFn�����������+[x��/4����� �+���w$��{�c�&�������W2]��6D�<�g��.���z�	�W}Fs<�fn{����}�h}9%m���/�d�T]����S�HH�<�)�Q��V.<�U2�Mx$���y��u��)sg�
����rp�0fy�F�c�Qp���$�d�p��X�9��B��9�XPD�,k�v��s�c;�SPgT.�}��3qn	d;�����s-������gJE�� ���:�a�SY:y=����f�4�F����J_;��N��@�{(}�wgfq�6���S>��\i�.3���&n{��K����J�T
����2VMV��>�=�$*�&�4��� w1)�#�M�~�T�k���qn� J����C������V��C��x����������������`�e�bZY��Vk%���� ��Y����R62�s�1n�z�y�D�B^��b��&��\��GY�4@	u&l'N�_1�x���2������,�T<S��&��:+��v��!,I�w��I6�74�O��|���T7����:����s��-�?��4m�m��=��O�q����E��Jkf2��",9b.M	�#N
���!�o�<7�7#UK������$l��4h*�N����E�S���O�c ����C�]C�3<���XwJ�b�Bd&�����P��Q�+����(���"�7��������%���X��q�]�1�,�����qac��X���z��	v���uc��c���D��l0�n@-!���J��^��U�3����;�t2�%���"����.�}���� ����S?��N�2O����bO���E?�r�84��S.������h��a�7������^;4K�������w�c����q�����0y�����!O_��'w�6�����u���I��/���E\���������;��QT�����������Y�O�jRMZEo*��+��(�]��S:�`8�2u�2Y��F�V�������s�L|3yw���<ya�.�5��X-���*�Q[�N��J�Ur�A�k&I��"Ow�<^�^U�����/���.,I�:v�z��}�XL��PW�7g�/N�*�L-F����a5B���a"��v�#h�UI�F�_�?�ot��Z���M{L��B��ale�i�DSH����;)��`7VV
�+j���a����D.��`��(>.�h�������������u�������5+��8�c&�F���+�����KL�\H�1t���w_�G����z��%c�-�]��X���B�����?h�2���v�����+z�Lc<��.������,[{����-�����z_�C����G��?C��'��W�U�:���
�n��>�}
VR���LF�9V�Y/r�q��[4F/���GCbp������>W��$��.\������R�O>��YK���?�>#�7���S���G�Ov��K���e�=�������Xn�����5����E^a�<7�NW�����k������(�UN��9_�.=RP2�T�k<�kLh��;��V��>��mxW'�8p������8}D������k��i7Fw�K�.��.*�t�cQBsY����{
N�cM+�/K�*+A5�_����6��������&Y��/b�9p��� ��ve��-���(A��z`�U�3D|�k��������V��j�fp<�xT�t�2q���20,�T��T��>x
�iV��b�����������ur�������b�������	�l�2=W�z��9���E���/1�e�w-�E���>LY������0���P�-j��	���>I��8�8��GCU����P�Mj�_���i�}�(���2G���+��>��$p�t��h���yi��EV��P�EI\MN!4�.����q_�������]zw'�a��	�)"��.U��T�=&�/�7�9<-yZ�y�^1i.�W1���V��f�,�]�#�'��/�4���*�j���K]�'5���(��?[���boJ���
fn�H+�i���ZF�
'F�5QID� ���3D�`�~�HHK����,���`�`DN�?
Fe������D�GM����+K7�YO�����s�4�cf�=�3�5G$
��.��UUZO#�����Q2Y�5�tV�n��p�an[4��UL�X����X�d:��	c�{�S1�wt�CAs6�`KO]9�
$���jq8_r��Qt:������"�j5��p�U�e���4*�j�H���������#N����n	]�t���B�RIpQ��{�6g�����X1-T�<�Y�'�*�_�E���R��)�������&�0��.�������d���V�>��m�����,��+Y�%��Ewu(��A�
��R��z�5���Km��,�R9��!��
L������?:{&�Y�Ht��s:�"��2��&��:��1���%s<�&�ZU+��]�X]��Gc�O#ae�Sc��yJr�W
�#��D
�����d����g�@��$%�F������ ���s����`��.4�>f���BU���]E������l��Ck�Cs���f`���^���lH����5SlbV���m����C���[��{^�*E����v��-����.R����*q"i
�g������~�]2��X4���j����TE3v���qj�~��b�O���`y�B����3��,�&�����O�������*A����W���?r���O�m�ts?Jv)�F���
7������J����������3
�.C�EF�{���x��I��iQ2�`�����(��_����ZT�W��.L?'���7
J���*5���l�W&����C���lcY�W��|�f�����~���>���!����wG���On�i�v-�-�6�^���+�U!������[���p`�����67
�09�� ����(�&�1���V.�.3T�9<��m�htS��i�_��?��Id��������c�����"�`�fM���Z����b^�r^3i�l1�5��*�'?�X���B���
����Qjy���������^_.�����Rq�.����T$"��1L�-���\�;Pw�+�$�cZ3x[vT�jjNP��`�T��d�F�A���	�n���c�%k�8M��	�D����KTm�C��r>F1q���}��6F��������Irk�]�9b���]���F�P�73��
����\�������q�*9���D��x�u�4���<�'h>d��K��{*6�(����~J�q����
�1�����W�0�\H���$T���G*���?����5jd�p�J�J��k��C�& �����[%J�06�p�z����c\�!v[��[���-��\�'�T�����=���?i��\�U��C��^�%����-
%h�Rb�#2��7G��P�q��"��\	��)��V�$9M�cDbn[�K�XP,Lbj�7F����r�_���a}��W������FI������ajr��|�k�`���I�%�(.9�X1��5�a��K���@���O'�OJ�'����{=6�*��� #4� 9u�X�_E���}�s}LE�S#�eJ9��������l!��I2�s��Ea��W��\�A|T�@6�F����������;��B;�����'�Yx��*~<�m�P��;�?�;�Q�Q�6�#��a�6^�}����� Dh��X�!j4�A�?%�)v>g�O��R?���w�WE�MM���%���w�_�=�Q���(��i1����4�����`S�.����E���g�meJMB�#�S��X�Vc?���Fg/O�r�V��-���h	��S�N���^�������q:����\���-^��^��<qO>a��F��l�4��y5)V���\�l�v���Y [��l�[�R���&�b����x���V������a+���:����}�$Jv���y���A����u�����z��{�
_������{��Z�7���]����t�����[����j�g�V�������������g]����4�<Y��
���w�S����J�����v�m��JXF�Y����������.�vo�+�g"��3�6e�,O����������D�r�vJ.xy��6{��^f�|��uA=�>*6�KaD�^���8L��o�-+�m�o
8|������c�
���3�?oW>0�na��|�/To�o3ED���7�oSi����7�C����e��)X{��������X"��7���!^���
�����j�.���y�:o���L�OU��=7�������!?���M�a��c�h���a�������'��q�Z�����o��������	�
������������m���z�/����{*���(�w�������n����{=AJ�	��a���;,����<R���m�d�Cg�����eeB����%j$V��b����Q����J��x�W����s�@Q��:�{z�~�������=O����q����"��_{=����>�L�o4����i��x��Jo�='.Lb2�nv�4=z�$��� �R`�?�ld��JF��=�\���v�lG���o�Y!�O�-��X�@���5���x�;x�<p�M5}_u=}���Oyt����V�Q��@������[�	o���|��M7��y������}on�bC�#�����%���i{����������v���(�[k��!a#��S���0����}��-�@'A��=�4!���3)�3��=����*��x����#��o����o��)Jj�$�8���N�}!����\�{��S�e3	<���1Wzk�?E���i3�(��iW�K����`���;���Lm)({"E^#���O���5{a4������p�5�'6f��r�&�L$�d�e}���dB�6�};E7�|.��pN���X����5�tWW�5sZ!��a~�����4Q�mX���EH�m����	N�������tZ�(C9�t���e�G��N��S
�~6�
�a�(��V�����.�=\/��na��Ml�p�2v�2���f��h��GyhoP?p��a|��l�8�"�S��D����h~"@����X&���w��q�G�����)��|S?'���NV:����������s]�8���g]��0\�%���D��k�(z%�!x��Nj P~�K�i��O��e��_�J���y92<��*����,_�\���A
xC�ARF������;�gR�[Z����9�
~���;��.�V�W�Y+�o��qP5�Eu�b^=!���+���>�t��M>�����5�qy�U��)x�W��TEF��q��WyY���P0����F�|p����T��OV��0����cq�%�?�Y��>��d��������|�v������xM:�a\kv�����r1*v'����XX�:As���T�4�{.����ag���
�����i�J�7��;H&�P��9~�T�;�I�|����1����X	�J�_�p|�Zs����h+���Zv��|z�x���l+�!2<�����3#.p����f����7����n��f4{&8������V�JO���3���;�p1�����
��`�%)oi�����q}�'����!{%w&~S����{�]�m�q7�z~
�n^;w]�C�&���
`2��',���������x���|��n@����E�~/'b�Iw�?Pkn�����lR]�<R�>��0o}=�3���u����efB�:�t���9����3������_>��kfD2��^���^�d����F���`��K�s��vVK1+���8R5���,s�6���X��0��]n�,z�M�$�f����%�)k�S��(]��5Cg��B�u�R&l:3@%����<�q�C��:���I���iB�6Le#Yl����C���0�Q���c�#�I��C
��qa��nJ&��<��s8���s~�s\I�k{��ZA
����n&�?���7�+��n_��l=>���I�z��"�dS7��k��@���9�,288.Yv+�n�LT�t��D�L�Q��Y�*���2a��7a�(sG����7G��~��"x�,a�<,��:�g��GIZFn�����j��+l����guX��HC�tq���8��(��r��gw�B�r���_�gBd���#��tw�?�p��`<��"�^<�=*�t
��*%X�Pt�D����/��<}%�$|�B�!�r%���Q���z|��g������&��0'�`�z�������+Vk��E����r��$�����C>����"�����zu��r����q�RE��-'Fq���Nt�j�p�������~T�BJl�����Du{k�` ��h�l�M1M�vC��H�t�A!��.�m2>�;CZ ��uW;�����Lr�%������6�jL?<ku��	�!���|���T�U5�}+O��amM.����d��,��;�K�n4 B�����s��|6���T*�d��f^�<@��������
��&����`������~����P]�2c�q&;��B�,�=�����~�p����TAXgr��z��_j��(����H�[���8��\O�Y�.�F%��o$	L6�mLv�0���&Y^���4�.�A��{����k�*6�k���GN����bM��e��wp+w�PR �Q�����6?]��@���:�$�ta�>d��F�;Iq�Iw ���R�L��K�Q��y��j�l��k��*���k'��J}{�m3�K�O�K�?�k�mh*���_�=�V&��z��UYg$�U�u��	_�H�tR8�� �&������9_@'�:�����`3Y�P����d�,��UQcX�SK�DS��v+��RC��#�����
�]��=���N�f ����Z+���A*>�����c������EK��=�M?Ch��/I��6�`��<H��6�L	m.�L�A��%�����\w�e����B����zT�I�?�Tb�(d�����M'D��@������4�
�c���
��T���ZH�'/3���N�2v��k�n��|���I�.�Y� C��/X�!o�5`�������������je��Y�++3A�������t**B�3�����i��EO�J(]vl��`es(G��,��f^���<g%*C��lT+`�#V�%j�a�"8��������k�s����3����������e$�z�R�@�U[UT�~A�\ux?U/�<8/#�����@m��!3Hb|�E�P�A���tpl���q�E����p}[�L�����X�-��L��:f�E~q�g�n�6_`���� &�k��c8��\`Z��4�rr��U�s����`Z�u� /z������ �\�fc��N+��S�
M�#�b�H���l�x��&�Mw��zw������n��QM�F�;B���&��)!t��� t��Y#0A2a��^�����y����[g4� �)�Q�tf���8��^���Q�p�,[�j�z�8��p�a���
K�{��"�u8���|�e&p|���������f!��.����{��!��,��a��u�r_�#*����C+k
��.��,��w�AS[!�~K<��������i���9vDU��K���H�k��c����b��(tv	�l�pnm��}�a�M?J
E--P���K���B�k���d9���v�@���8W�d8�=p�/�14��9���x?�w`Sl�r�
a�"T*���ox�V�?W��9d��x�X\0q9�fz����)�z����l&�D$t?�@	\K�Jxe=7���3�7��1��9$�m�/�����!J�)�������`D'�.�N�U�Vm���t��@��-����mm��a���M���^[�;���������~
q�������J����@\w.xP��q���5Hf�%��$VJ?2�`%�$8b��	9�����[��J��t�g6�
��1���n�szNzn��SC}���C��R�/n��<���O5[�MgI��3��=��7t��������]??��g�z��>H8@����FDA;�;T���DR�y��=�mo�`�� +W:.����dYY�)r>��
}"NIm��=I��S�"��%����MC��S���$��������0����
(�$�H���h��y,�C���G��p�hU4�c�o��A'��B�	��c�]�^w���COo����
C�"�t�(!����N����	@�%�P���*U'!+:!d�����v�������Z�c��BM�a^2���7�?���l�v]���ZZ�l5t���z���:n0������^[}��T�Xw^k|��;:Z���Y��P a������L��z	�i�X��r�r��&��	�HK�M8�CV�B,�]3|+�-��	��B��"�h��V��87p�� �@�g���J��m=V�3Q&����3'�� 3��I%���s�]���`%"�X��]��O����&��-G�j�N	��-���N�d��d$H��Q4�R�*�S���oA���x��e���~�����jI6�o�jk�~tb:*�Y�&jD�k��9v��n�ZIf�}p#c����67����e3'9�1��0��tG>���m��Vw���-�55��?c�
���]�.�0����`.ei�=y��9���$3��)b�4^:���I ��(�d>%�� A��V4J�(|����p*�����v	�������.�R��*���?���\*_���k[���������:�[�8g/��g��&�/v*!	u��H�8�>��B����[�P3���;B��VH���z�G!bR���N'iN�4��J}�����J�p��J���������eFPmm���=3=��\��a��d�1�GK���
���&nu%��-x�(c���F�����qNM�!k��.�{��j�])y]�4Z��B�L���Bn�J���k�$�!z�6�n:e'wE{�m���yMa��.�����e�����bu�J%�G�Fqb�\�-��0�/��M-%U�~���94��w��>�P�T����V$�-v�|�a���c�(����sY�z����7sk����_T��r�8J�]e��V4~M�i�4
^��A�
�$���3201�8�	Z	�����h��n�3���DX����f+�N�6�o�������2Q�D��!��y�������X
[!����{�%��p��g�$�*`��$�����p�yd����:p�QS���B���All)H��8x}�[��:�Iu��u]3RD�B9S:q�nYy'`������+�mE?���j�N�.d�Tx@m��L�����%�����#�uW{��:��d�:]O�,K�����	���6�j��&��_75�P�z��e���9��Q����2��,�0���U�S7�G��&�w-7��Bd�������]HeT���3�r�>h+As:�Uf(�Zi�w��D�Mn�BQP����N�9Sb'10��t�;E��P	L��@x����D�D�]���9�O��	{,.R���PM�:%�k��i��G'���s���d�x���u*���M��sW6�'�����Fp�	���.�B,�5�VT)Sr�i�I��P^���*�:��K>�a	�[�4��L��;A�FR�#�a�}�����B�Z�r+�����m+]6^�����ce.��9(�;�R���9�C��
�P��:����+�;������B�^O\v�����%36&��I��a���l���D�'j���j�_���N���M�7��?CnLg*����v0Q.��'�(f�1��Z?9t�^Pz
5��[�"V�l�1��%�ql�� �?}���B63���O���<�!���*A�{%��Ef#��)��X�B�i�y5���&��1��.�1��[���m�C���!~��E	���ij�&�5gaFQj~&<�wa����]_���&�y����0����AU���������������:�I���4\��&,�[�h��`����r��Y�s�J���0�_�=�{��{��
��������|&���m����0G��AN��%��!��9	�� �bG$���Al��sB�z�DK��'Pd+����{^������#Y�����M�l�XS���
c�P���A��?��d�1�� }����GT;��g:�%�p�N&6����d:1��D������p��v~aD��V��W���T/�cC}l�����<.+@U�S��7�E'}/y9����[�j"�Dw����L�����if����W;^�w�(
N�
E5��+������*_�7�u]H�b��.d�r�������kjL��lAh���<�G"�J�l1�[�p���~��-"��/����n�������u_}�EJE?>�Xy�taz�Cn�|�
KZ#�Q��z�~{���m
fZ3h�WT �P�����f��vv��x0G;cv�c5��hu	�bH"Ds��~�8�$��~�|=L
����;q/F/�D����	4z_	�[�F?�|Ye'|�{|�Np�W�]�MB Su?P�(���7���Xi+�~.R���mE���{J"
:�|���3��O�4i����I���Jgk(A������p��gh_�
����kd�!=��9e7����0�����������n�����YH���6wqGS�I�#�z�(�mk��q�.q���tb����"!����F�x�*l[��P\�h����K���B�Gs�+��^O�;`�V��
Va���Jwr3���d�l�"�|y�N�VL�����
?�|Pk�<�"��D;F�F��f��p�y��N�����g�E�����M�gP�V���u G����f3G$�=;�(��I��u\�8q�\�����C��r�)������`���p�[F#pQYX#��	����8�0�g`�?�G�����$$����Yx����P�K(J����������k5v]w?�i*l�(�p.�a��N�AO���E-F ^1mit�B����6����x����CH�|+T	p	��	�t�,�������@��?������`A{�Z� e�ck�C�P��?�����Z�G��"�-�{�k�pt�"��8bZ�Z��(������3����ZN�{{��T6��ND�B�
%�N�qSB������4���	p�����n�L}b\��UZ���m�2���{�(��dK�D��h�	OD7�h�0V�lh0�J�������5�96�q�Z�n;u�`��e��$�V����L)��_���n5��h52�oDA��:��b�G�A����W'����~k�!^K��n
��5U�� A*6���"�����q��
��_���N=��!���\(�vf� ��J�B���6C�7J��7r������r�t�G{�1C�J���x�Q��QaF�tr;�#V}(7��zJ�8Qr�
�1����S��e��BJ�����S1q=x��q�F	���|n4�7g����l+�����Ax�N>��:R,��a����GP�7r���m��8�}�|z���t\����(���jg��EG���_�I��B'�����X�X�G$�p'2*�f(SD�&���V&�<O'�/B�<N����A��|��V�
��L�n�ej�@���4?p�������a>"���],kQ�nb���B�b������y����4���~�J��6\+e��n'��j'�DHb7��3��,����2~��,�,+wXA�������q$_PJ��>�B���WHY���.��g$0�-7����(g����@�qx�^�:4���g�iheAg�h�����2RUWx��{�7���8��y�B�4Sn���b���K��C/���h7�$2�����5��U���f�P�4rn&�!vb���=N�RE%W��p�yHC���r$S�H
���q+��oTF���v�f}[Y\��
��)���� &�)�v��v��9�+���@�N)�������'F�*?S�OP@��oJn��;H`#T9I7�?��Jv�$x�BJ�s�D��6����E�P0�0k;�#���#,��*����E�
:9���0�\�C�0�7X�e��8m�j�J���l�
��p\z�mkc26v�NK�n���dc�l�IVW%��q4!S)�t4��"PobC0��6������= �
�Wq���J�!��W��ig/����g'���'i�u�&�,�[v�,
8���W_���H�����7���6�{u5�cFn�6y�k}�D���y/�������5�:Uc��EC�j�hQ ����F��g�
�4�s��8M]mwY�&
]��5HHm�3��$!u�E�m��@������M���)A��j8��&�]����'.(�������cH/�Z%�L$��D�/���n)<-]4cSPv�LS�w�nAe3���3����R�ZV�	d�X�<�[�o�6����]�5��M���
9}��:U7�uD�A	
w$O~W���!����2�@G��q���.bQ�:�L��u�MNHn�B�&���5X�wOu?W���5������F��M�{������b(��.�������\
=FRf�����g���6�b�m�*3n@���M{sT^��M=�8d��&Cob7��Y�;�t��=6'�C�d�>6KYS�\]9w��+�:��~�!������4��A�&���)2n:e��8�����MY�*pLPbY�2��y2-~�t�$o��v{�S��JwB��~��'��!��NxP��=�(~���p��Z�W?T8I���o�-2��D�^#�z[�0������M'>q<�{�C�h���d���=A� ��L�)|7�A�<k���W�����6\�<�`O���&�%	�{[���@�C��B���
G�G��Ld�*�'P��Y���[3����#�����e�k�������~����O1���:��������3�P�����t���K������*���Q�hF7b���;m}��;o_J�-I��#d=�N[}���V�M���j�h�JJ��<m�@w��7zoE�(�N|����"|7:�G�~R�����"'�O��Z��Qf����f���Ta�r�w�����;1������z(��}���#CIn k:����gj��V<�.�F�����s��<X�3�5	Q���I8��6��B�D��'���������yS��O�_vk����2�ZW6���Ka�&�Y/���#r�y*���t�[Gz%��K����z��	_�C@�����z1��������X�3��$��2S�!3sao����4BtI�4oc�5��"��_S�Sw<�d!�[���"q������������![)�B�'��\7�6�T��*Xe?mb������*���"�Y��zB�3'��p�A@f�cS������KE�����r�+f��{�p��a�l���<�R�tF����v"UY!l����$M3z�t�.o}>c�O"�u����|�Lo�#,Vc�g��r;�VO�A�Mz%�,|����UC+XL��	8]�Z$���a>�R��u�B?��|�|��$������G}8�Y?
9g��Mm����9L'��Np`$���7��99�����i=2p!c�Y�
�Q��	�G����`X�@�=V��@�e@0�+Y�������^<3��/�V����-S��3��r=m�9c12�$ej�T�k�G�u���� �V�@_U���M��Az��	P��1�����D�
k����8��
������D���Mu��D_��g������s�u���6b]0.���$`�Y|�����ggP]�����w���fi�4<��I-M���A���M2�`�(%D;E$��p����
�@S��EP�Y��L����������	}�l��K7�����`����2S���b1%zJ�x�)$�R�����lW��Mv���%������Df��x�!��\`�q
*8��Ws1NT�@����W���L8�BS�1B�~P��t�3���6D��)pZ9�k����M9����t�3�t����?�{�������f8���xm�����TeU��P�)<����RN_�D���D���>����[��[K��*s�m�!����o������6
���	�#(9��	?�N����LS)�2����<��i
?�	��]a3+��cu�S:z�^�~��
b�9R:�@�i�'�4P^��� U9��(���������0�<������4��F�R��)08�fP���X	G����#�l4�[�0�!��n�@SKI�������*+L
Osl���(;z�K�NQ\������Q:�Y�Q��L�80Ykj~4���A��~
�J�(�"%a��(k���3yR�@�	�(n���
f\�J�z����$����7��liQ%	B�m��X�~Gh����?�����.��\gF5���n���r���l.'�U�?��j�	6#>�2�//���]��*g���[���|2c%�����E�u ��q[G�2z������n[M�"0�(��6b��E!|��'g_� �}����������S�M(��(3��_T9v���nc�{�T�A����m��$��e]/g��J�ZTwq_p�f'�����N]CD)�]���M��}:��p:���P�Fu/����D���S�m��������e~���Y}���U����A�!X?�U;��N����` �*zg�!,���Xf�H����
��E�s8{Y�����1A������f��A����+z!��t#���;�>��'ta�!b�"5���H{4�D�duJ�i��]CD�^��d(���bc�Q;i�ixf�����v#�&��M�$K�oz�C��]����W1��e*���%��c���GU�V3L�$��Rsg��Y�L...��5P�������,���n������������������r�>����ZB���2wU6�&<;��]L��|+���eD�?�n��k��e/�U1D{~w��G�zmK�*%,po&������X��i3����0r^������������!�<��8T7�xl��G��DJ�
!�MQ����4U��q!�T�U	������#��]U@}JqwV	�O�
�5��S@�H5d��G��CR���I���\�k��8����R�5��v�3��6���f�}�$������]H.�f�[m����'5"�Q���4z#���i���\�[Y
��R�Vf��f��\4�jD*\����^���+��d5��������u^�`6�1W�G���N��s��%D+��(��S|<d��
�q���<~��&ow�"�Vf��
��O�7h��PG���?�b4�k��0��LW���_����n�����Se�?����E�IA�Nu6Y�N��{M"��H�4��i=����A�N���z��U�����[�����Y���m������<:��F���!��h�~�����F��^�K:�A�L�K1��e����!��IA'��$v�(-�z�'��?��L��q+���.T�$��n��d���9����^��/��@��^�����[�����*�
�m����tl�(�EJFa�(� ��x(k�[�������^H��i9Q�2�`?� /���`>�(���������2<�
T����:>C��������y{l!�i}�{m�����kIL�C��hs��\�j���yxm�X�y�v}<����@�(����}�����3�|���������x|��������p$C+u�&�!�yy�������u��t=�W��)�
���&O&]������Jq��z����^	|m�[A�B�D`�n�[|oWA�w�{>����3�'
DU��|����r��D�'h���Klr��P� >A@ @@`��JZq��C@@�=��������������jM��W����{~�7
�����������#��~���1�O.�p����
�O	�������:;9���:��Y[����w3���p����y��Wm����Q�+<L�A@��@�7���������i�)�����]"����117uv�w2��qdf0���w2v`���*�������������������b�`k�`�����������pr�������o\��.v�6�V�r��7|�

����������������I����Q?��G���cgk�l�w���|���O����Z����A���Z����5��{�iX���C����3���171vt�w02�_)�~0��K`J����O��������������G�����������������g��$����N��F��o>�|�����d�`c�������L���L,�l��X8X� ��X!�2�����<���p���i��������?=�$�a������{�����oggen�n\l��m���m��m�l
���B����X��������y>"�5c�c����)b����"#��
V��b�
��5���1���_ �I)����#---333���TRR���244���ttt���������MMM���)--���iii������YYY���999���y{{����wie1�G�Pr[>����S_�~���x�����I�yI�����Zufm�i#�
i�F����t�� 3�+Kt����Z$��j�6&���������L�r��m�j�_���'���L+G�2f��]�^6�0V�g�W���|\N��7{�z���O���}���qO�?���\e�����B
�d5}��x��Y�
ij�
��T�y$�8�*]^�M�rn��3�^��VF�x�8:�N2'��h��s2'��w�d�|�C}M7����]9=�U�J��J�G�-B	��<k�j^Q����� fW��:�q������J�V�$��8Iz=�k�-`e3��>D:3@|�����5�%n�c�"�u��Z���b�sl��|	5Ok��Zt�$�����*��}�D���;�i�?�jv���RM�s��\/L�������6ua1�,�a���en��P��8��k	�Y��#����uf���@��}�+�d�y8���V����t���T�G���l��0�{�qhd�V>/^������A�{J]������2��@��|����+E�Z�8������dex�[X��r���yQ�
��M*Y�h�
��������+F�8{���tCl�&���R~3t��{�#u����g-������������������;/��'����r�z�<�D�����>ty��&1��*q��a}��v�x��m��ag^
��
���j����1�"/����i���Av���\|&����(�X��\��3*��P�����g#Q���w���B2>�	��dX���Y*e���T������?���Cgc%zZ�V9ER���(2*%T�/��*�4��:hj��9������ia�|}�+�Y���C"8N���:Q� ���u+�p���+Bu�8>'�g�$>����Uy�a��U�GcD�a�W�������C�-;� �A�G�.:�O��+����V�� ��V3{\!cdk��y�qL��!��v6�H{]����&m����8lB/,?s�������%T��K)����T�7�Sd�}T�S�0�|�Et�)������y�dC���S�a@��x�t�����T�H���{����H�(�w�p5�@{{�>�JZ��O�O�[�R�_V�1���~y����������q5�5�(�I�[mj����yI)�]+�c���#��(������J�l�hpy��x� *,�3���-�Zz��C��\��S:���8����yzc�{��?�%�P�>`�������V7(t�Q���
���8i���2�h��1����3gw������l�:����@��)�d�!:(�X6rS�����+����N����������\o��4�"�B�_.���{UKo����6�Y����k5��LU�=��lZ�)O�}uT�5B��)���	�9� �-H���]�SB�%��}��+r9�������8Ikhm�6zq�`-��6p�$n�
&^��;�JR�kqP��<J1EHR��4������P+�r�����Z����o����D����6�������:��RKR�)V�DQ|&�:�N#�����7�!�����UjY����5I�-r%RK���R�E|�1w���*}�4�L8Vw>����r����S)�Q_��?��$Q-aKTI��n������2}~�������w����O��'Zt0�O�:<@�k$N�V��g����Y=����|}�;�&�Iiho�Z�`W��"��F��ph��B�l�W"�.�(;5W5l�km���1��!#s��`����A�4xcU���'����55�nK%}�a���u��*��:����������T�U�k�.jtP�:��s
�<^7#0��u�6o��=to[[��+��BP��a1�����aFcG�)"||.��A��@1�%�T�X�;<|��]��z�y��J���\�*!��1D�2��������E����>"7W���J���[xz�9*�)5(E����"��M�vU�X������-��)j��_��>Zla8����y����j!`��$�]����|"�XO6�	�'^�`2��l��3Q�!!	�[s�'��P�yH����z���)����@f�����V�j��d=��b5�Q��x��t�8[U�K�=��C�d�on�pgw9��D�����YuX7�~7
U���vU�n�LB�)�����7��u�N����;�����|:S��a41ig�^��J���`z�LwSX�E��&OU~��c(�2�0���2��[.�a���R[���R[�!����&|���"A��N�|'z�p����W�:�9K�������%w>5x.�N��C�D��V��E�����
�5�?�����C����C��b���B4��R*���$I��sJ:�]�������O���^���a���oB�?�1q/��f<("��	Ly���3�-U�V�lc��q���\�h
d��!8�V�A��m1�W�Qa	�Xr�
X@��w�(��2�8�&���I��>���V�3���.}�,��S�w���F�����f�z���4��z��c]���I�uX�[5���5��7�9����nj����G���P:��^�{7��/+�����#Q���|�|��{�����$	3�
7�oU;%����>o�<!�|'�^���\�)��q��]F��*���e��{������:����u�����:��*!���O5-���g���&QQSX)B��E.��j%��cu��q1�"��rV�@���(~���)=�����f��f�zuX=��'=��!���B�.�WX�|�j8���Tj�$�H��~l)�e����,���?���������Yu��n��^9��&wU�z#�.%5��f��8}JD?G��'���n��)�h� ��T�=k��6���u���b����",e���9�*���My���<WM�C�	��/�O��`�v��'����������25����%K�}f�S����"O��"S6U6,��$�gN
���`�o���yB��Z���t�j1{KW��];F?>�
@������\�?G���AG(�v��50!{�|��7Lx���	
���*s�B�����C���J���_��w�(���gKj~�<���)u����J���^����r6�4}H��������b'i��^�3B}�4���u��},HG<F���w���_N��4w���&�p<>b3���/2�����r!8���Ay�s���r������|e��@�tq������!��>����C�H�U�\��A7L��#w���r����n�[(Y����%r�ppSm���3�7�1�����X����s�k6�"������E#�z��?5��O��r�
���+�`X���a;��{��M��~D2�L$��������CM���DQ���#���'�����u��[t���������{�+�(��d�I�,�C���t��l�?�P ���������7
����f3�b�0�WO8
�����0�v���Y����<������������)C��P{D������N�L��%���2�����
"���a��*����_�����}�s3��<���'��0d��=�����t<
6FI�n�"�}�i������'�C��'������g8��� 
�� �]#�c*���|�E�����5UhJ�*�*k#�n���a����)��7K|M�ob{�]Uc	��N9�E���������9,���_����)����4�7^��9�M]�}t���1mx%��c$)<A�*"�g<�n+~�.��a���[���hi<�D��������Qf9��� =-��t8O��O�H�:��g���N����]_��[�9��ail�3� 2��9�t��8�����	���s]�������F
�XK�:�m5�7�Qz��T�.��"&E7�4~p��+��c�0��;/�T(��."TB��}p���~h`��^Q�>@����=t�Q�!.���������6�|T_��j��tR������^�������{�����}�x �6 ��5v����%��g�����z�B�7(JN�\�@�����B���p�d�k�Lx�6��'L)x�����Uyq��Pk�����	kt�n�p��?�T��:'�nk_ij�iK��9% ��G���X�V��{�������+���X�:�n��,��6L&D)�D\�T��D-�Ak�Ys���
�cV������"]���2�_f?3J'�����u�d�vU�/{Vx[�����z�k�A�P�H�D���H�wF�|^�;�gR��L�zyJ�Aa{t�5���~�[�b�\�mqu��2���<��:�_�����r����T���������(\:D�/�t
?��
���q��0���y�aD�TM��~[�y�|�U-���T��[e���o}y��z=d���U��F�5�0#�d�h����y��'�y�a���vM��-�.�7�j��		DQ��	���{@��|���5@02�� *�����<�z4�~p.�\A����M�B�Pj���)��<��{#&������B�/f>j�Q�����2�_QG���pT�"�V����&���^�&X�����#s��mn�����-���t]r����(��O2(����!�A���ZDg�A�4�u�;������K9q��V2FG'��	�(zX�z�c�r\������������+i�����P^��r��T��0KH["�O��J��q~HP<9������VU�����U��U~��e����1h�J�}���JlF�'��v��9Q�4��F���^;!^���pJ(�c�����B�����S ����?��n��0���������'0E����xu���k-���cC)�;�7�#���_�4)������l~[N���D�=!LMm���vqx4(��US�_9{��;���qs�Q��p�H���9z>o�"�8� .�	�Dq����<W��c����ev��a�����
>����a$C��/���'c��q����`f���~����3��:O"\}������u���$X�P^d1o���_�p>�0n)��nwP5�R�E��+�l�[�T:�Q5M�,�x��5���z�������N�k<�}sM��B��Dq�k�IrF�7uA����w�����4��v�+Q��=^��S�5���K�����1�l�/wN�:z-�p�4o=��f&D'�|)zB�!A��'PJ&�z��t�����
t�]��&��a�Q� 8\��@o����VUc����{M�����mB<l���_��n�6����t�������_��W��D^�D?G�PW'VS7��C�M�����n��=����~�=�C�4l��>��������[W�t�Q9�q!��'�]�L�_q����D��C�����FFt��QY�]�t�n,��R�}1m-�p��K�vp!N�qU`D��:��O'WI��E����b"@�>X�5�L����sXc�@��o@�����Z7��n���:�
��;�����$eH�`�Ra0UV���q����DF(�
�(�42��Q�� L��
�P��R��&<�kq��]�{)r~�������:�t���m]P�WOh�^�	9����L)D�i�	��.���~}u���>[+�m����U��>��\���_}_��>F������m����.����F��WR���m��}�}i|�!�����C��!�<��!Z����:\�i���6?��y�!����o�_R����8�#O.�g>-GW;5��h1�j����������<�6��������p�	��P���(�t��x�G\�W��V�H��e��X��3A����Z�$����y��L�:����0}������}9'����\k�JWw���^�����B�n�����U$l����-�K�}���>*�����jL�8"�����_�L���D����A�������k�n�Y����w��=��7�b,�r;������Q�?������:��:���h������"�L��\��u�I��?��/aFQ�-�z7�
%}�I�L~A��!�PHM���Or'�M~5'Wb���*Gsja��T�a<TUn�hK_���	�G���cd;����'������9�W��{�h9�2��������4zP���W��"<%�t�A�������7�����Q��_�)]h<J�6���]��_���+)FX�Em���'T
�d��n^0�DgE}�6:|'
j����S?67����[}�E�mt��w�{�~�<�^��?"��qkp1)xu\m��T/������������d�.��M
[kVl���	����6�/�V����P����\n&����JM���&�cmd5c�\��G4���,�i�~�^��*>�r�������Q�+���M���(�v�e�dZ�@	��<�����U���+��;�
#�>����������iP��{H���a����$��J{�	)���:�����5�"/�k	t�)i��,r,|��)�G/7�2�t\t���Y��4:����3�HL����-���A�$�6����Io6���a�<c0|�TI
��~���2�q4��J3v���.M;y0�M��5�3����M��p6���%i�<v?���e����6�G���:�����&�������wF!/e�Z�l"���i��T�r��f���e�@�5!�v��z,��@�+����-�����-k�e������.Ga&���0p��>��s��Iq3���Z~��!q���w'�Z���_~tD�����Wl��2x�~B��z\���������]������������?_�w���W��1�# ���x�:h'��]zE-$��i�����j��r������|M9�/������
A����\�N��?��lfW�^5_C����~8dv�\���A��4�r�D��fYe��S	�s�yN*"���R���zF��m(���K6�\�j�i/����dW��,2a23]��K2���7~/]��'��GEe�s������r��h�_�"l�Q?���F.�i~�Z�!�W�53	���,s�]�TLF\��G�	.���5@�0�fk�`j�?h����]sW�]�m���	j��0P�9��a|�`��@��*�F8?1���6�w�����{��_/��kv������.
==���a��w�vy�R4gP�H���.�s��X�D(c	�ao[�(�������C��9��y���~�����44mg����E���K7ADbq�G��)�w��{���3Z���So(��)��8�,�����q�+�h�QZ��g�L�q9c�x��}#������K���zw�F<\�CX��b�>��<�n���X���������/��Z!n}�i�Y��m
�G�yV��Py�����d��?fTU6������+{I�nk�+?��v%�2� ����!�=�u`f�nm�%���ly5i������+/��\H����P��6���e��Y�BZ/NUQm9J�
�a������by��yh
������T�,�����/�������y�db��~��tv��,0���V�6'�r��rR��*��^�3I*#�:���P�N �����[��}�Z�H3��^���3�"��H�Z�|q�b���>�1��W_TZ	G)(Nc���g�t���O��.�fO��H�Y�G�B��}yJ&v����E�/V��\g�L �������U�9Y�LqkB���YY���YT���"
���}5�+�vf���UmmA���*�U�z����e;:�dlga#�3P��vj��J+^��
3\��z����+����������&���c���Q�� m��8x��������E�
�x�3Q����� ��(l�Z~o3�&n=p�r��5Z�������b��#�6�5L��������A�/(}�<����[�'�f�T%1�I
�H����UH��4��?u�e!�j+������W�g:�Y������9�|�������������'d�w�V����?'������\`(��b�$j�U�f>�hU���i��?=+�����i�$��Z����l}�"i*v�$#%����>�2w��:���x�iA�����}�[0h�B|�A���A�_0�[-��J�����u�
�RR
��"�h{3F�KE�����)6#i�������d1�2I���Q�X��:�i��@�T{�(����fu����?,�hl������!s+}�ew��.�5Z�p.~t�i��+����P������
����]��k��M+MW�{���w}��������8b��q7X}���EDA��TW��R������2�����8�W��AO�b|�YXA�b�{� h�#����]����Dg���1n���U�'2�+���p��#��]L��q�y�>���A�"�IrY��S���=�?3>�-GU��|�g`�4U���L�|�2��N���'P�N�����%9b�9"����I�35���8��Y��,�]��nYQn�o�FG��j��`}�����W�s���A������&nG�f����gc���K��o1����W���65�+�|�x;����}��������jPq��O����Uy����k�r.�-�O��5�E*��TR���!��x�xm&��v�&S�f�EQ
c{���f�������0t>dC�����2�L�0B�c��X1o*���E���A�7R������![�t8(4�s	�>�W�3gW� ��b����r��H
^\J�����!�m_�[Xu�=�;��������$:�8 � g-�H������vW�^\S����F��T���!�W����%����h���TU)���R������j��$�0���n�!\�Y
4���-yZJ���U��3_���)��Co1�!}X}9����
U9�9G"J�@
�b����LsX�9������[��S���7�����A������<��B�
���P�_����������[�O(�}|����o(�����	�|�?���c���C�#����ult,v���-�h<����@9�d����[
��*c�=2VE>��8cv�\�e#�6�L��`Xua��!���Y�
W}S�
��b=
Q
���^u��
5�nB �G���lJ��k�l+q��w{�����#�ow��8��?���;�d�q�����U�Y��_��T�5�'���'�/�2�},	���]Yg�zm3��=M���z���=����s-�
�x3�sC����
�8�4�s��n���7��w���Za+����?�WD����_'m�g����ogg�%-z��>���#p��x���(�q��s���m�KE���z�}1f�t��nlz8G������efM���X�x��a��u������
�B����)��c�(��(�W~���'!G�<��k���x'_��Z|�mb�Y��=���xf�wWE)C���Ar��j#��^�TT.z\��r��_d��3�"?mH�Vy�y����SCH
j��yz���'p.IFv������R� ����
%��m��"#R����CX���R����R����R�����w�_���g���-�/���k��k"	G�^���&.�_��J*�e������l�54����
(D�Fm�|���$��eH�{P�.h-���H��6�"!�|*|/�T}�=_�K}���nU]:<oOG��@�����;��Oo;�w���o������k|�m^��-�w�����^�g�o['�/ws3^�WGQ�����k
>���o�
k������'H��=�g
���%�#��o�3|����>��|��78g��F�2�������W�}���mo;����c*/{}����A�7wg�>�c/�0��OA�Z�g���+��{>O��a<��-��:��>K��i��"w-
�u+�C���o�Q�s-�G+�G��G�k��G��#?��.��u	t}���^�PO�/�nn�����t�*u��fa��]}���d��N8�W�w�|�3�\�����[��t��/t���_'Z�7��+t�<�;�x���nW�;��.��;�X�FNl/t:p��s./���k���;g�@�����>]�GW�)c���N]����u(���e����}������(�����
{�����te���'�����}�t^�Z�����qry����`�����:c�XA��sb:����A�5�{Dq�������2<E�z��s�^@��d����*q�l�������
����Ng�������	������
���������%I�E��OwE����A����x��!XW;�>�7�os���
��m�O�>�{�����o�i7m���o�������#������l=l�6;���5@���� o]������,p��~{���g����e��W��U�}��-��b����h�'�9X�f����	lBQ|m��o���=OsQ��/v����}������^��
�����������(�e���w�[E_���1k������K��S�����]����AC����x������j���������������;���"���[������K".�(>�������GW����s�����7��6�����E�� �����K���
�������0����|���o��/�����:>�!�/��O���Gg���c�c|��y�l�V+������D���S.�	)D��GU�cM��	V���=���+*��+�C�����1���9��� ���%|cG��F�g1�ah\����?��<�f�����g%]M����o��+��T;� Za�}�����,��o����������
���T���7���=�w��� �[$��<�����WY�����(Y]���m�����P�3��+�
Ay+O�K�������V��s��	����P��������>��:��������,��v�S�}l=���P�x���|Z�{;�nSq�"��}oz�n�>,������u%�P���v�Ku�lY*��]�C*4��B��/c�E�^��e�KY3��Vc��E���0�%��!���?����|�=�v��]��9����v#����d�p5�����������-�������&d��n��M���q���9~����>��/dK_mn�^�\G,�o�W�oMTl-
�n;���g2T�.����fd����%����dT��V��^c�����\��n�Fnj�(��m[�!�WW����==�KH��Q������������$�%���-�������������Mg��Y077����o��[����>�"�.o,U$�)8�W�������zN�/������-��r(uZ}y�����L�������<����7���"'/=�r�ySC�)���&�:([��{��n�~���'s������3K��d��$�{���di��3:3�3�8��.��vd�:%�X�VJ��?�����5�{FG�������x�[_LM��u��Mz%d������~d������������S��������M:����W��z����A$�<����e�D���+��������Cz>��U0�����������^o�o�F�[�Xq!m%�	$��Z��������W������u���U�C��S�8������%������_G�����|��I�'��_YM��]������37��������;����{�T��j(�Q$�u�yA��f������3�����!+�=?N���X]P[!���l��fM�E.o���>#��c�3i��;7�y6�mt���O�BQ�Z�`}�Z���W��xgr�^�u�������?^��A��W�t<�Y��6�7y�/9OuC���Q6�r{����k��
���{����gd����g
z���!cJ���;��z�T�WeB}G��"kV�#��m����V�����F$��b*���[��6���f�����O���;F|��mF2��Q��a%]�H����1��b�-���I�5_c22}���2�}815C?����F=�7�Y�6�7N����"���pDVx��E�����~I�v���N�n���%���GJv~a�`����1g���z�ca�@���}c����P�YY;����9�Z��2���Y+V��#i����\��*���15�4|mj�7�{zk��:���e�Iv4���������y�?r4s\Vf���hs�������l��52s�>Ivan��#u��W���k�<31-�*���������3I0��^_*�l�{���~[/�z*��5[Me��zmj�Vw�s�����d� ��n.'�-�m�M-��d&�4��n4����������-����S��6���LM��R�RYa���{n����q��f_d��6_?���J��^�H����������������v��0���o��a=���/ve�#j*,�V���In�6�1�������\e��	���3m�k�BV������[��#}^>q�+s�����OW�h�2���slI��b�*1���Zq����c�e����M2/W;����������h������=��
��K���)�8	v�[���1����42^�vh�';/�D�/�E��h�GiK�dy��'�UN�o�f�w���X.���`��o�Z����VM,�����z�a�.X�o��D�2�r6��������w|���<qI��JKd��,N�I����QxVj�n��&MA�w9���Tu��1J.��^R�g��sP�����6e�3��1?A��)Yvd�V��w�8�p�)�K�x�ll�f?~@�}������Bb��d��c~�/�����M�����k�`�8��u�\J�����GL�K���H���)?6*�����wF�]m,@�nOV<�O�}�qW{1������g�b�m0���N�s��st����)}Ez����J�5#�G�Lz����*�q�
��=~�w6�����
z���}��|:�����hc��T
@=����q>���j�'��O��:6���"<�w#�@�[�Kw�����SAst�\�sV?�3�����T�����&��O��{#W����U+�Zdh<�3������X����5���QK]7�PAL�����RNH��v��xAl�-��%�N�4���2�u�����P���5�.C�s�M�
�d��d[�j�����|�q����3p�M��>�<u���	S�.��l�=:����6@�I�)J�5�2��i.��ke<�
^�D�%X��$y���U=�JhgH���jx"O����"f
{�UX���bW�L��7���,&+��jV<e���������m�8���z�__��������j����?��Ts��f9�|ki�xV�~B��o7m�8N�[a/���C��]C��`&�;'��-�+�o�����P�����-{6U���8�F�ZIaj�'�H�[Y*�AklT;7�
��K�bP�y����_�+�r�/�����K*!8^�%u1%��o���q���"���p��*�# 7:mb��]I���8|��i��N�!��Q�rD�+��T-=��~�.��a�	�v���5F��Y��&]����~1�t����=VRV�\�F��>����?�8+��SE��=�]N���9y�G^���^��O��+��Bmf4�M�nq�8�VT$�cC��UDU�[�2�A�T��NY�	>�}���������{�	�\������o9�ZTW�c2����>��s_�������
lsF%�M\�"�+M�����\�mNNZ��.��I������L��+��t���.�	�e���p�N���p�'�7gE'�����D*�(j�V��q�8��R��b��TR��Q��8��%`2��#k����+qK5�:i��y���p"�W���+�Jd��sR����J$�o!�7\	�a��j�dGS�QT�q�T��M�^�Fw�wJ��G�N�-C!��������-�e��e�e�fO�Mfq.���f�V���L^�-3���x��3���Ay1Vv����(YxWL�Gr=�`�����s��]��������F�q)t�����lZ-p����8k-e�o�<�����
3(�A:F%���������/�Q}�Sb�/�����p���!�; �����������X�:��"z����o� ]5��Vq{���g���b6�x�_����dB�VJ�dxy��������"8�A��o�L|�p�)�f��7\��)����]7'�����~���^\��(5���J�������DmH���L�%>�����U	,��n�h
��������,;�I�`�0]�O�����
������3�OgB"�U�^���K+�a�lwA�����5��fk�u:�����nq��_F��O�{y��@��s�;���7/"���oTZ�X<��6���S��R�E�0 �9Cy��?�}��"�������Y��-|�xXb�,K�z������Jq6S��"E�eD���0pLWp��x���ow��=�w��i0�AFc0��	/M!�6�ED�bJ�Y�O�5#��QV"�!��i���o����+���N,2(2����J�!|X��)fY%
P�:�\����/KJ$���+%!�!�e��[`A����3h�
qZ'�\d�$�l`��0�M�@���X">PVw����f�L��K2������AV����$�C�_>��4��<6��X�c1BF��������������NR6��mg�s��7v�Q���{�ci����{����9Q�yC��c�~��a1���A�\X���M[o1{k��������;�K�R.h���%�
�����r������iog�X��^�Cj�����a�_	�A1S�*�Q���47K	���G 4R��EOf;�Q�	��gu���#���.T����������g�0Fu������ye���f�?.����+m��.s<��r������*Rx5��aa:���Q��������k4�����K!u����8��Cn������<����Su���CB��4%;�(��9�aH���A�~��_{r��������*n���$�\�X����U#��D*��j;�k��6�GM[�4�\����G���W�C8"�Sg������?�+�S��W!8��ul�0��p��k�c���N��"F
�0;_�g_��Q�-�}}�� �Y|��J
P�q��������W�>[$�Y�������r���f����f�2#�����q��}L�����s�G/���,���C`YU2�^��/�x�$1�b>���$L���#���3�>���	(�QV���#�z�:���z��X������5F���A����!�(��k^�@M�3!���#���H��Y�e}I*�����f����E��Jq����������>��	}�����,L��+������2�����>�����y��n�ZY5+W��{z�:�g6d����-Tj�B��B���@������vj{��!�����,��c�3���cAV� �-��="��i1�����a����&.�nWLr��75�(����Q���

O�~^������$:�L��tkb��%��������&�-����o����9M��T��i����B���v�p���h�ISb%�=��"R�ie �b��q\#��[Vs��}r�����8!��(a�!��&c�z�
.��d�U��J"!���Kx��$��Dl$�m���Qi�fI��w�@?~�u��p�K'������ dO�x5�P:,��:tV��?p�O�{���'�����(u�;���(��+#iVIdH���?�EG.��*���
������z�F������r?*������������&��a�o+l�q����hJ_�5c^��H��a�h���5XRKM���G�7�m�~����nVk�pJ)�c�Ra�6��eM���WL(9����+-���a��)`*	�cl�����.}����� O���-��^�M��^w����um�z�������W�w�%m���D?J~�lx'h�i�0��p=u�,��{���*r@�����r~����D,�n�U|��^.{�����fL����S["���4*��q�����_�����������yl`��z7�^3�%4��i��K������_.��f�������������6Q����:����W)v� �b}���}��5�T�	��@����i�7�J�&~�,����qo��7k6X����b�Q�m�ME5!����,c2p!�z���.1@9���T�"T��Dg=�'g�v�:�����M�P�BW�ogK�8w0�y/��Jna�x8�����*����q�~�5>5_+��^�/��s�A^��u��n������}Q���M=a�Hbx~�U��6s�Y�;1���&����1s�8Y�|��v�,c:�iN���/��yX�'�	t�0]U|h��4����_�}�moi���B�Al6E����p=�o����iS�gQ�.I�F���6��M�����#����F�������]w#�$���*��N���P��������.b���/���G���^d��1�/-����e3u��������\(i�5_���]
��_�o�~dJ���'��+�0S�q���*~)���};s2u&$��~�g���K|06���T�)+E���^��f��^*�0����9�#.������4<��&���zX��k���kR�y�G����wky~��L���)�gZf�����l�/���^Y�A�s����[�1�v��#����z�zY_�4���=
�m���h��o����Z�`q��:���fFQ��l��<�8��(������%e�@�����45��|�n����!HI�?hs��RHz��r��e*�����x^o��|���(:�P�U�SWB��� �x�a�&��� �	T�������fEpo"�3!&��� ���L���+%�b�	I+c��������X�%��t~��n�	i��&$��75l(����3�1rL����`��#���,�U�����&M���/�Ki.GWd#�Q���N�_��dvpw�[��_�A���v�]Z��Z�]|&te]3�������k�!�"�w�w�h������.pI�V
}�����m#T%��]2�\�����j������*a ;���-��p�d2��XFI�:v/������Z��B������Q�X9f����������{=OAC���Z=7��r�XFY�h<m���e(Q�O^��Q���Sw��v"��n
���HtX�����|t�R��Z��W�������[�4Zc���T>^��U�+,Y����T����#47g�%K2�b��Mzzn�e��`��1���f�l(�n�����m��Y�������g�����i��{����j�!!�����V����Z�����a�fcA/����LE��������V��.��:�D���gE�g�T��>v�8����\���9�����:�W��p�L|`t�J�%����v���K5��CFd|4���g����1����+�@���OkTt���5T(^w!nX���B����?Z����o�����}�,6M\d�r���F�D�� ��z��cp+MDE����UY���?�<U�m"kQd_LHW
,�Z%�/d��v(�����[���v��Pum3��wR$x�(��i0���Q���HA��{�/���/�1-��zi�r$���b�SB[����)!����qI
��*���Y��
hZ���������@Z��jr.�;���Z�(�"�w"7���01��`�"���_�xE�H�@q����l��y�����f'j������G��,:�R���"�����`��}��U�y��S��X������������B[����/!�I�����w�_�TI6��>m�^��X�[�r=�0�2�m��R����Qq�9�l�@Zf����
�t���!���f�|�:����	M�e/�a�����f�����Y%�����v�J{X�1X��oO�R��JT�wKZ�$��4c"�p�!R���x���`L��g(�Q�X�&�R&]Y�������/g��i8,@��{�������O�e���9�T�d)��%S�OOx�]�E���|�)���V���j�g\s���=�@�}�f�������wPeA��|��.Yt�@P�]�5}A��-Fs(��!��������}J���J�B%�����h������_���U�s��[��r�`;�sf�__Rvh��8TL���A���KU��!��6 ����n+r���p{q��n��$���7OQG)nK�
J�
��p���C��!J��HJ&���k���BBP�"�"��*t���VQ�j�oL��I�?�9���.m����;��{�4�r^z�����d�����!7d�kVsx����L��LK`�-T���zWD�S�r��q��Y�������J&YD32��~HC��������Q��+O^q/�
e�%<�M2@��x�p�y��Y���K�&��F�v���R��bb�&ElT����Z&o��R,�m��l:"���+K*����-���
���Ck���@���o7T]9Hu�Vp-E�e_�6.{���B��A>�� -�z�]�&���Z��Z�V�M�L�c�f��^O��#D�^�;j��A��*����h�6���Q�"��PJ����Q#�u��T���[z�~7��5��0��Z���5���m��^�7S�h��k ����P��>�6�RY-
��f_��o_|� �3���nQ�Z�0��r��d�9��'T�.�O�vgv<�����������_�^����L�vp��_����6����9Z����v�?�w���)l �t:l�X>Tr���J���l,����,"Evd��6��n��RFj�����F����!�����G�$�����J}��AJ���;N�X����5'��)=�������<2������J}^�$E(sJD��S�3GDIK����P�����s���T�����}V�����cq�Ze/�-�������r4��E�;�2��caR�j����=&��u���d��5����t�7���V���[��vY�d*,�S{�gC����r���}E�;D��q����+7�;d��2\c��I���o���6x���@����mpJ�pNW����$����~���E��DS��cF�������X����%��%��_OQ.������~�rdN�!�b�T�*�!���aR���W�;k��Y�N�0`\�a�?M�\s����$�,�����7���b*�YD'��K��/���d����G��D����V��L���#��`T�T��c�������Vy�5e�[Vs�4��+�T��Yd���M-��������`����C�����K��I�i������F'�u�F�@m���Pg_�D�Z3�fP���WU��J>�I�-�(?��a���I�X�]:xg���[�A1%f%<��T���^h��)�pdn�
��6�g���gD��G������
�-���������q����0:��	�&`��uX�f���^��	&��Z}X�r�!�e)��Gsh8n��P�72����en���x.Ah���#����>dv� X�xP�F���������r�*�'�:���~g��$���+���I��V�>����wIiRy�QZ���$�[����)=�5B��4����FT�;�D�YY�7!(�������j��0r��U����B������FE Jz\o��F,�"�
�f&u�#�w/;�����b>LmT��b�S��9g��x�4(!
���
���E�XU����>�����1�a�$���^��pE/T~m��l��)��`��G����I�������h Dx3��Sl^����,f;#Z$���>����z�=���Z����Z��f�S�s���'�c��Z�B�����C�J��8��QH��KH� �M2|�����US��H�����2�PS��A�v��B�)i(�8XU���N�>���9ub�J���O�;[jQ��g'�L��w��q3"�!������@E���) �Aljw�|;(
�z�8���������X��C�J�\-%��$q��(�����+�x	��\�fQ�|y�~W��1��pYG����0U����z���".�"�"����������������%K�PUN��g��,���JGs�d�����FM���FH9\&���l��(�����(�5f]0�{@��l�Y��������\!�i����T&�����Fb�E�MU{�Vt�Q���l��u�����������K��U;�*�����%�b���`���D��q�a�^�zp���D]�|�=�\�yZYT'.��)��,�(PfIb��Y�����di�m��
k��o�{5u.'\�A����g����\��C�c�$I���������)��LW���[d���[c����I	s2� �b���{R��2h�e�����e���fbh���\r���Id�FN��r�@C1}����B�vm������}������� ��l��T�����#�t���r�Z������;8�H�����_�J��7Uj�����Y�/����jl�"N�'mZmI�����h���n&��F���y�I�A�u����h��$�M|�����Sky�����,CR�pB�S=�
���aU�;0L��pI
<	�����2
���u1�^�rV)dNJ�ji�J�E��yHxCH0��@JaBx>l�������6:I"T���#v�"^�����0`��d���pP� ������Q�![���{�	�LW�������#�t-�'��U�>o�����n
"��K�K���B���&��`v_���#R��6)��42�����.�Q��w���1g���B<�0�������d|��G����*�:\���j�!M)|#�����N�2��~$=�����W���'�-�4����q7�����S�Qn��(����h7���l��DZ�y�y�����%�e��w��\��ND��~xa=%Bb)�,;���SLJ*���I&�It�pp�-��@�wTOV�|��$9;���&>��IQ.�������
�U���,�-%���J�c� YJ���hnL���]-����u����WLG����59G�������48x,`�@�D�e����
���s����x��q���7|=T/�N
��+_���C�d��x��z���4�������g�J�R5V,��Q=h���.�f��I:C�"�����<�CZM�4,����I��i�B������pu���c�������]���_,ca���������>Gu�WR��W��"����SY���^��w�X��'����?�R�(3sl��w�,"�V/
eP�w����3�V�wZ�<')�e���	�B!�	T��U��0�����������>nm%B(�~FT����A�N���sDX���`�����n/��X��G���Ti��)��5�����'���7����c;��������F�9=zK���%���F9���G�����G��j��Vq�Z[~�*�"�S`���|��Zz=k	�zP���_t�Wl�b�R�R����y��`����G�[��_����X������z�o��^*��Nx��W�����8L����(n�Q�vg�q>)����i#=�]�.��+������o�$�B��u[���1��\F�����E�@�U���������,G��@�i���D���A�)���*��b/,�"l?@�����V����\����0Xhi�w��<����7]%��+	=��7�����C8�|\�,�>��,�M_�H<�i=� "���9��h%��&�S)�������)����{��4+�3��s���	��#�r��\��&~y���<7�}��
�p�x���$����Jm�������=���
sA4�w�*��[c�<�aP���y�xFL&�������w��LqQ�Ir��{k�v�)����og�w�)���#Iq.x'k�h�T\'���e���g�Z
��/�itAY���iz+{C��3�F9�����N����*�!�v �r���v&DX�*~�G��U��������4%R����9�p�q�����uZ�z.�z��|��`����W�����?��7f��h��K��r6�/�����%���N��m�������=��"8����gt,�-� +!��yv(��
}��r��!�ZF������ {�!a���U;p�=����N D�K'��J�FW�mJYB�36bDd�N�����G�=�*4��E�'�Ra�������c��	NS�
��V'�}�^���]�1g�R
�PTK:$
��>G�9�����\BnR�??Jh��/���As��C����k��4%����4I�7��d�����C/@zSx���P���X�
n&�N���_�^m�D�Tg���'oJ��0!S��bY{W+	l�������g�q=������3���u;��T�Br����o�������?�����m��f�f`����w��M9<�TL�y2��3?a���YG������C�"a�+�����#7��"�i?����k��S��#��TM�����wz�S��V�g��0�(��������;lT;sE���N/��^�#_����"X���I?����W���I�����vf�7��(����w������R9F�rc{}���~�c<�����;�@!<s��qzE^�	���;��
_��V�F0���Ko�`��G�/�&���E8��B�k)O��y�e�a;
����D��
��k�{����m��H�2�;6(��vP�����o��qq+@��|�������xw+m���G;�&�\�R��0z�������a�V{����d
���N�eI:�#l �~�;"/�������w[RYv�~�L���Q�`7���N�Zu2S6&���$��g�6���^���/iD��(�
2�TR���]B�d���`
� ������5�9����m�WMu�d��n+�����0�$jFk8����9� #�u���?�����]��DM���zN��@8H=�<
�U%krl�J���a��;r��u������Z���$j�M�l�Df��b]b0{�x���~���)�����JO�[0}��"F:���!T��_V���"��\x�f����a�����c�J$�u��ywf��Q���N=�S��w�.���d����F�b.���z@���q�]��[@8u<���^�M2GP�U��c�+0��~�v��s���e
w�
+�d�^��[4`�9)&����]����"�hp����qB����"�����h���\�B��Y(9��^�v�/4�=2�K�dL���%�g�)��v�N�S���Yp�����}��!FJ91[C}����96
�w�(����	���jD����T�5e��T`rt�h�G1���o�=�aF�W��zQ�Qd&k�+yC���G[�M����n�&z*��$ZG������;,6RW�1BLY�M���+�G���g�����o��^+�g�����V(�Q��=�
Q�����k���v�8���&$�Psv`B�0��|�j�;t/%a�T�
���V�H7�~�m�.��-�H�eviu��
l�6��b�9�������?X�0vm�7��W����l��\���7�����5-�An����b���7 @�)kW�y�x��f���^d����p�����l�%E��	8��'j�BMn�;��Hz�u�J��A����K�H���0�����B}���}�	��x���J�>���GR?dr�yV��*BO��&M�>�.E��x;���(`z=&���������-�w��#���6�C�R�*WA�+@3<��s2D������fo���C�	�0z�:#Z���Q��1.
�,�B:�B;��":4���=�����<�j��Uz��_�����4�B4���������<�
OG�7��8Gi�}!�+o�xXz) �����Ty�������E
(�y�3{����2J��H'�X� |Z;���%�[$x�3�z�9�r�G�C�3X���t�[����('�]���x�j&'J���?K?�u��/�31��/jWr�����������bM8�`	�e'd�jO��"����^d!�JO�z)����������'�8m���liFq�:�h��Bze�����?�u=Z����Z^S7������/L0�C��7U
*���0�G��2���������t� �T���ZgJ���}-��b�Vx2Q������"��5�K_�{O���	2�?��s#�Kg����zOp� ����������+���h}x/hTl��;�7=O��VKOO���_��G�+��|���@)�C������5>!	�7-��&O�_�%��h-\�i->��}�it&�I\	s��gTBX�����=;1�+���*R�+~�R��Z����=X����<oOFqH�DC�I����r����#V��y�M��^���fL�
�M��K���#�E������v�&
'��-y�����{��-��fWh�i��?O~s��������&����4Jf�U��gU���C0�� ������J]��B|�#�U~x-�W��b	dR�B�@):�Q=0�9����zR ��x�<�b���V�����������Fv��4�A <n��D21.�.#��N�]��E��q�yD�1X/B*q@���G��������G�f��pUP<������2�z��B���U����X���r����&�"�A��h�a��}�T�]y����=����k�fL�����Lo��D��.��q�iy	��N��>
bk$����L2`�@)�02��^�������P�
��������c������aRL��������w�����&���(@}@��w�����K�J
~�q�+}M�y�o]{��K;
8��.��f<F9�����S��4nz�|��B�@W�9
���E�c�[!xn�=^�f.����W���|o��6��N�9������Un]�jW�V4���""�X��~L���e�M��B�@�4�e�nI�n!��s8D|U��P=IJ�pR���{~y���e����1sa�@���������PA*R3�74y�R�k���I��8CJh����R��M���"$�G�v�6kC�jk'i�BR#M�c��}��^F����D�#������M��C���{���l�;n��6�u��r����pt���p>�����?5���,����Q�7����n(�J��V�(7��!��8����UZ���Y�@���.Z� �1T�{'\$�{�
��f�6u���E%"����
hN�E�D��1o�dVI��^���i��%:�$���q�xTv����
x�g+!�T�3�!j �F��G+g��<�}%��@Ov��]"c�K����M-��C�j����$���Gg������F��Z]H!dy-�`����1.06��l��VH���B
j��y��������,��`s,��L�����a��	2�5Zm��D��������[�ps��v������-��7�zM������N�����$���s�d��br�5����8�hK�k���]����NHz6�'����d.�y����)a������:( 54�*�#/P����+:�)�">�� Ds3�:�=�����`9o���Vt	L�m����8�:Ff�uU|O~�[���&>33��������(210mXW�U/���Y�l�h�TN��!����e��1yS�@h���/���;���wh��wps�q<�����z��$�x���XV��)��CVI���mc)����:�_b^<��V./T�}��������_��#����HTv��pH�t�J�9����{w�s�N;��>��������Fu2[+kW'
��x���3
��Q��%��8G��b��$F_~��Q�N{�<r"�A�T�o�R:�J9�UYt�]��!<��2�"���A_�<N��-d�$��=+���b=!��z_]w�#���z&�R���U"Be �������S=�1�g%��q�bB���-��c	g9�;% ����z�����,4����\D��8K��f�e$���Q�}7+7�k^6����
���A
����#t�����k��;"i���2��/7����N��o��_3T�c��UIF&�9q��9?�"�����sp9[���G��\���s��k_�#A�D�;���u�XX����Q��Z����x�=F&)�����W�M����R�C��)m����?~���H���.��8\��:����O���vwL��@�>!��@ ��a�u�����hF8]V:�!������if}�BX[E��*]Kk�I����}LH�N�����XA�����^������	�F�^��NvQ�Nb!����&��e��=��[�6�C>�����������9���)^�1"m�>��������'��?�k������,)�Z�B�)���c���R��9w`��=2�+����a��������	���\���!=K���S�Db��`����e����
r����p���=�H����TM�C�Wu����� F3E3��cB�J��R�K����d�J�:��S�(�e��������^�!��������
��#���U��JU�8P<�������&��������-�:gt.�6�����Y��)�#����i)�e�B�)�')�-`�mLq-�wv����oP
��M��|�-;V�~�������'g%m�I����S�%��S��c�a�CI�p��h��1h�`v������q�fmo��t�`����������%	MrgM���bE&��Z�,��=b����WT�^��C�����%;�B�=D�����)����qnSd��w��.#N������Z�?HO-�����s����j������g�Q����0
s�G.M&�g��������{�]���R����|OB�k��������s�<��5�{��-��r���
�x�����JE���ej�y������t�k�M	1M5V���4 %���`0��
y"b�q�vE��)�99����%��8�kt:�We�7���&N���H�������l�����?,uYQ��y@.���!q�Wp�%�$n�by��B�����#�r~����T�x��z����x�0m�N�])��&���K���I����{#�y�����&���V��)6�q��K>�oO�z��a3�S=����'��	h�s�|:����KA�g�#�x�~��<	f����3���"X |[mz�0���et{�sr�+��{�,8
��
�m�ve�sJmF��d6nf�
?��*��L�[ �'��L��&�(P)��#*3����q��%��Y�,��:������UN�v�Rj`�h�	!��=�����$����/S�*>]����8>�7����A�$52��gZM��]��}��Z�����/�J�7 XE�u��|�y+N��ZlI�D:��e�8n+d��X����3�������*tW]1��u'+J`e?��9c+���.���sz7�s)6��3�����V�o���wV�a��1(��eOP��4�-�,�{��I���Q��4���*�$��/��t&�#�����q�1�z�9�v�O5e�$X��%����r�ebd��<����Gf����#��p���|�C�R�:2���~�y~U�l�oD}�qx)��$f����gJ���L���MwNo3��#p2%�e�K+���-��c�:��R#s+��(H�*���n�o�����C�>��E��ma����r��U����'5; �l���Y[q@���^��W����3I(N��h��KS'!wn��-�{FFLEw�>��.\�^�t>���e����Q�|�q���.���;WS��nF�j�24������]PQ�F��M9�����e����������f��K&N�GOv[�)��|iJN���/���������S�|/KW�;�LL��>aP[�n/��N��h��Jio�d��J_k�"��	��N���0K��X����Z����(�|����\/����`j�	}^F�$ru8O����]^��b���-��$�~��x�@���{�5WlP#4S��e��r���XuM�^���1Ri:_��4:F;�xa����g��sF�������M`�E�Sx���J�����^�����1.,q4q��~���e���n�%����[�
`2�B�����]�s��.��P��)j���9�^>���������#��%]sa)����_������U����v&f��H���F�}W"Zs]u]Ix�W���={�W-�������s�D4����k��D�"=zR�����<x:���Yg����dI+g�&.G��I+P��RKP��!��&�,����Z�����):p!�2���^��,5H��S�l!`�Y�$�����R&��3!��2��u�
�:;5'"�hQ7yZ��6Q]�>�8��f[\7SY�f�6��T���e�O�]��6�`��I��1n��������'����z���Q�/�i�������������On���
r0���6"����d�/P�������KC��^�
Z�v5p�3����^-�8�dl�C����PN�>/�r6&�i�1��Q�����/�����h����~��*k�\���!f����!NJH��N�L����&.P�w{�y�=%9��X[�$��h����RH�������{
v���a�/P��{\B<����IWj��]O]��K�_r6>���3��D�<�+����8�.�����#k����elW4�+��J��p
4�������G)�h��n3�F�w]_�
�{?��~>t���R�/Y����NY�� �� �lk�����n�� b���Y�r�{�(�����[�wH�d���A�_��6HwO����DY��%�����+����Ye�xu/%[_4f.�B�l�%x��^I����i��g��fp��3�u�?���}d~��9�O�C���a\���koc�-�H���[�c��Ma��W_^���\����E�������^�%-�Cg�R���T�&�MS���O�'���������M�������U���
,�[��f�S'��.�g��nm�����������#y�6���#��{�<��Y��^�������#���g����fol�xx�~�E�����u�8���f]��,�df����4�����
�lO��?��9SRG��w�n��:* pA<�����h����|�xpS�PtT�(=k�L�/�����<N���K��%��?��u��ba�IK��0h�N7�+�
����9p}�<���_=�9 �D&$��F��!=�����.~!9����Y*������{�h}W|Qf�������5��5���
�����_{�V���*���Rg#N�S3UV�ay^������
��~��[.tUZ�����g�2f��n�-��LzE;�����W�3���2��m+��
�������������Z�i]8��lT2�����3��1�������]Uh���s�/	�y�G��w�������y�/��c��o��=v[}"t�������aL`R��E�&�����,G�[�a����"�u��o��8���c���*c���pDXP��
�m���xQ&�����t3*fj^�|
�����(������	5�v��o^�����Z�{�/�:���0cJ�6�+kr�L�h������6������;e�T��3�}��&U�YYcjC��*fEC���������� RQP��^���&�?�����h�W�����)7�|��6tih���
�����
�0Y���@b�mG�[�����S���~t���&�S���Y!��y��i�q����k���>
\��	��2��S�-sIZ�F�4�|}IT{fj��n�E�����_�?�kr�h@�[���������7^~G��C�>�W?������-�i��.��&Sz��F��6��'���j��s������CT�����@��e����p���W�t�dI�z���C��.������~v�.�(�P D|�k��k��xG#j�%~72�e�^;�nsy�A���p�`|��#���b7�����1g�n�t��<�vl���[g1��%q��k3.����-�X�1�5��pQ����������;_����aA�VMQl��;��n�[�0���|���4�I�b��:�k��jy	���d����	�F�����+@M��svpoD���
M��>e��O��0��(��E���3��K��q��.@�*�dt���;�+��C��J���1-��.[����MUI �FC
��q��jg�2�F����s�U�Y�Y�q�c�m��L���)������2��c:�B��}����p���
������vJ@������NaMv���4@�sQbr�>%�^���S��%�P��@����6j���/_��������(w���[x�/�z����?=�%��� �*�f�e�%�&R�3���2�?�B4�w��^���hF�W7-Q����')���?�������>���A�Y%����������R���z�����>]���Z�p y_�#N��A������Ac��,���I���j-�n�c�i:�Y�����:�o�1'�0q�'icw����[�#�ZA����W`���(9���E���J�v��p$���0���bV�
���6�[fZ�#���A��a���]�	��������s��b�1w�	�fb*��_'��L.���Z��������S�p�a]�3]�^CN|9
����5������2�����a���8 ��c�F	&��������.s��y��i�����7�9�2�Zw3;q��/
6'��-�?�{,�+������������m�C�l��XGd�#���qJ#$9�*�SX�O<�����q�R�����K�����1s����-]��0�������{�U�I�G�w{	������}}~�������N(�%�����y�3���i�4���"�����i��e�X5�6�b��A�=�0�$�������:;�+	�T]��c�+�bLU�57�Bf��*I������G�:��?�s����,�����A7h��\5S��`��$����#�?��%��.Y�VvQ�������y�+<�;���M�v&�m3��\���'�v2�@���.��TV�e"��ydV�{������K�A��@f�)��sp������S�|8k�N�����{I��#f�=�uH_������N�J-r�^�8�')�t�
+�������������FM��X��t�H�c�t� ����@Pn8�~�^��3��xT��3�W]~��v�=N��
��w�]����3�T:�2��������]l_�>tj{@i;w�I�m'�O�4U�fd�s��n�d�wI+��>���y��C�?�/P2)�2~0��:�������?-�Y}��X�v�I�|�N� u��mi�,�S��(���32r�(B����0�)d�E�q�e">f�����huW���p5��\"y7,���u���F�#b�R�o�.m{SY��io�h�+��1H�_�X����t���Af����o��K?D�T�Y��s;�0�h�/r������+j��9���kD�EC���_��jd]��_�����Q%�H+�F��Y�=��D����A�s��V����)�5J�C,����b�lmw� b�K��BE����:N�%��6v��IRiD�R�����GX�7d���p;��FE�G����>�y�sZ����G��,�O�UrJ�� ����i��hy�/��
����,���}��i��uk�����������&��>j�������#"�s�4�������K�W������:x�8|\�L2�����R��q����a�LC���)tA�"T_�����{D�~f�4���x�'{����'���w��f-��"'�K�%��c������������25�O���9�1j|�x����~��{i��d��P��A����c$C�>{�&�HpU������)��[�Q%EP�W20�o��o T���-��K�z��#���M�i{O�9$0q�NW��[�=f�8e��S2cN\���wv�cR��������*U�6j.x��dc!E��P���������o����F��Sx+�P-	,��B�t��Z��N9�0���zeK_Rv�L�_.��O����.�o�mY���W<�;���P��sz2Th'/K�c�M&����A8���d�b{�)��]��2��-~��f�j���=4O��D�hqF��A�dGO�pw����h{h������o����'�(^���
Y�������Ig��Z�`|{ ��<��������#�.�c���^����	�M��t��}�3�AF0��{(@]UOs�m���[m�qa���<����k�������A��9���N}�������x��K�^v���+��k���<�U~����s��0����i2�=�b��>I�x�"q�����+�'9�����e��x���Cf]�CA���������UU)�;kfr�j����(7��(��A���F.FW��=8)�1+o��Dt�^'�k����
uwxu=;��5����)5�5?��.+a7��������������v������g�H��/UtW������	����-b�qU.��qM�#����#lP�������q�9�M�:2E��j�����KW��5��h����D��T�q�m4�Q���:fo>8Z�����izUuO�L	�� HS���\�-�?�K�������,����n`g��i{� eM��x��Z��k�\���r��.�V+-`)�9=�UOb\_�
_D�o:��@&�����Y�<��Mq*��`Wb<�n#opY��^]?���I��Na�C8}x��#r+�<Cl
����Qb�6~P1,��~o�f���ZBh�������t&r������^/���Mt��<����
)9pF�q��S "y]�k3��	:�,��h�~�c(=K���	�v����s��^��}��l��;�_�!��%���=����w�����J�,K�>�(?���=!��WN���w���N�a��	��
]�2u+��w[r�8�������f;�6^���2�e,�Y�hr�x2|Y��>�G*n�C�����t]/������m�������_%vy��d&%����;����y��[?���Uj�!��f�t2��IB)m1��]*8oF�;{�����K��� "�:�<�����{��uN������P����4$.K"�*wK��R����+��rN�T�~g#�u���kw��y��7����Ry�����q���zy��&:���>��{"WIB���R/����+��*4�.X�<���e��PuA��;o��8;���H����P��m������'�����q�T�>9J���/g>D,663�B+.I�Q
�5~�b6�;�z�w��rMI��=�/�Bm;O�;rZ���	�Vs����^��"�Y��e����F�#�����
C�T@����t�x:ULd��T7���
����])��Qk%UL�,C��A|��g�Ve�[�L{G�a6��q+�����k����0��X�n����V����%�������
�a���"h��
��s�Z�,����rDU[�����b�!�:X����������cW�R������Y@G�_u��u��HV*�s
��[1x��v>'>�+���9 ��0X��VU�;����^Y��tqui��['���V����6��U:�
���1c�~�����p��`�Ib�Uf���a�n2��V�_�"�-�\�2������`	m(#������<���V����J:�B�������mJy����FYD��F����e�F�-I�7�Z)])�zExC��k���v������<`�e���P2���(��l��F>�ii�n'�Y�����Dj�����i�d���
�fhA����j�p��/�n��.�.~9���]P�Q�?�y���L*R������@>x���E.?AB�U+�8H>����]������%^<�Z�v��5��U����iF�{������if���o����j�?�0��`�c^U�tbXJ�!��u
��8>�~�b[���u9��_��@�e����L'��S��� rw���+���
P������K��.�o�Y���R��+�p\����L�1,�t�BO$�������7��?M�M�W�V��'A3u�k	���I<�0+Sa	/���g-�����czqY��k�����b5��rJ5�������*-������^XoC>�2��o���E|�����������.��zH(��*Q/7L�q�BR}�:�{����3����m�A�.���T����
�/O7S�G�8���5�/�����9�z��A��jf�_,k��Vd��|L��i`?7��9�#��6�El���������k�:���O�]��<�5���4)ik����]A�����i��C����(]�K����BP��9�i,��]���_�����Z0�w��$sj�����4���0���������������jtE+�Mn������l�l��qG��t:>���Im���A
(����Ov��CQ�;#��I��8�F����a�c:�C�9��/sK���-u-�
b{O�ts'^=�a�,����z���X������!`���>-�h�%.rC��_ie-�D�t�h�|�R��O3��)�'r���Tie���&f	����x#c!),����n���y�a6:�����J9���<�l��im���#m{�1�z�Y��#������k�"��Y7���|`�|_�_�����S�q{���&Z)iTg���l4��C!���N�~�5��y����"#�����&�M�~`�����NAg;��^���h4�v��i�q��Q.������5�rd��-�r�9e�������3;X.��.(��uDFO��+�S��{6f����K�������h��O�B�';����H��t0��{��Fz�a�g�A���^�{�w����rTg��
���$�d�W"��,����B��Y�tu��z�S��j���q��j���K��%y���)��Gb�P%�WB�)��Y���
lZw�$V	����pIH�U��s���0����E�PF��}E�������4��Y=��O�[
kd�0���"j��O�IY}8P�`R�_�(�������J���*{���(���^���v�r5wT��e������.�in�����$-���G�N��M�~��~�
��TB�+�N@�~	5k��_���_�I�J�r=/�n�*��4Pn&�����rn�7�! ���g%
��6����j���{�<�@������k���QkU�)�m�\��gc����'����<	VV(��[�TU��Wx�4���R�t����q�
�Y��������3m���&���bh�=+��g8��s��}��6'2��!1�}���E�����%f����Z�EH[=?=��)�P������1���`�f�����������S� -��3�7��B�qB���6B��S�S�������17�U_�E�1���>���si����
<>m�n=X���^�����p���-�(���cZ����?B���L��9��C�C�v����u��{���}���8/���>����^�������t��U���EK_�n������N5v��s�t����!���;�7El'$I���V@f��9c�X��4;V����?���[#D�~��B�}G�
�l�L�^B��#��������Z~���VWsU��|<Pp�y�R��s`��00I���jo�7�J<qj[��s�W�LO�z6��^�j����Qh�E����Q����D)�m��p�4��~�|5�0�::���������������>S�"]��M�R���������?e]��bG��L����
��t�>}iad$�#h[+��������]������uX�z`���pe���2���(�a�bK�H�a�u�=d�@��NB��p��
�	��Vh���x�t�W��)5f�;�E��G��=<s��I��t"i���J��p�������0��$G���8�����]5����*	K��n���:�V�0�s�*T�:`�m�C����"������������
�U����k�T�Nzdm	P��.��(�_�X���rX����������h;��Ti{iHy���������	����\�<m3�z�h�&=nx����6�u���'���$�oq�<���R��
XV�H����,�pw;3Rn�H#�����uP���~�w=X
�9`r_���x�Zg�������s�y4��g����"�:2v�z��t��}���l��j`��(&
����|" �Z�uo�;[����2QK�
�M���u98��U����3��F���0�<%t0�,�y����WK2	>����fw�(wZ���[�����]��-���8z��	I%vx�8u�f<�|�\�"7�'���.���Ab0�������]��
Y��%�����^���*]�k��D4��u�%��'�Y�L����D
���l�|z�eX��*�.]�v�)h��'������k��J����:sNKJe�s��e���t��7�-���gc�~V)�0Oo#�,��+%]�B��/)��8	�R^38hN��df�A��gH�O]8Y�.�W����{AG���+XT�������5��_1�^������8��l%������U��B�x�M��jv4Ax[u�smr/��Ju���,b�l�T=�;%���1�|k*��6��L���rn��,K���NZ~M!�:	'g��1�|g��*�P=��UF�>�����/+-������	�.��$���>��d����a^��Z��w�.�-;�e�����m�N�t�7g^ ]	4(#���"���v���p+�.+�r(P�VN5�vu�����a�����eb?�R172.Y�J���#��C����U#c���zi�c��#c����Bm��I��{8��U���R���"��(�ng	��%�-(��+:�0���=K,3e��6������������)�W
���
���3�~��Rr�zS?������cS����eH���N��Z��t6h9��\a�7��6j���ek4���.X����k�\������Fy_.j��������X�d�=R�{Sc9�`m�G����'���:��p�n?�q��i�[|u�^H���-���T�W�lyP�+�#�H�����������Bz����������}���BI���6�,�6?S�ac����!"c�����F{�fTM��yR|
�N���t=d{!6�6?�_��^�e�F����i���X+j�+�`]��H��>W3P)���y����1�w���G#�c�Q������Uq��n^}px����HY�\r�8�ta�<�����7�*�����u�m���_9�{X��|������k�S�
~�@��q�4�	�}��u�cE��v��N�~J#��y�;	t$�����Sm�S.��N����*��L4������t����D ���>�WN�kO�k���3����������:�Mb��k����+��,���^E�� ��=�Nb��(��c�w�_+�����NvP�}#��s`lMH�f�4U�c�5uV�M�+u'�B=	�p]�I��I���;s~t����w����j��w'���yxZ�#��SO58v���J#�o�����c��SJ/�]&F(��U7����j�W���=��c.3��h�h����X��;�S�)"U�,���T����O�4M�TDd�s�������Ey�|K�n�;���.��k*��=p33�(���sW���l����{�%!~��y���~)qM!�a�����&��:[g�����H6e�l�#?f����=�>1�G(�����U��*�U0#��,j�%������wf�;�i8�u8�2>�Pd���G+u�:abP������d+�f
4�*��B��51�BF�u�#\�G���We�{��Gy�i#o�}�;!�����<d�}����.�\��~;�e��v
��y�������3���Q������v�J��p���+���gY��	��`�Ot6�M������Gc)��3" J,��6SX��9j�~\�g�������R�GqeI�;�WH"��<�Oy�q�)�A��J�<6��wZ5f��lD�$������W�?�%*�2��
�������������V ��d
��ZY��3���fJc���lG�Ob��<R��
����u����W����e"��#���Nq���q.��������������9|N�B�)@�u�x�����2���.�z���'�q����?�X[�W4��U������\���+8|��1��?e��������mp�i���������|i�d���5�oy+���lPVK:V�.�������t�.�<��� �J������FTp_�/�$5������n$���.�����/n�_���8�!�e����o��p�H%��Q�8T�j`�T�!���~���=�:_=����B�\�����d�W3��0zuYod
w(��U��_��"���S�R�g
����F����g�j-�7��~���)�M��!�*�����~S��k�Bb���rA~��Y0�S����*~�uSk�ym�^��O��ZT
��x����ME��}�hY�6�����b��y����si�G�s=�^�9r��x%���KN0F�%��B��lp����9�����\Lq������{$�5�������oTXHMKd^������'�^��|���~�8a��S����r7�c-��g���Z� ���{^$��Qh��!J��Z/������_���3��}�d����v���?�~�������
�k��D��#�����c��\�� �p	M�#-�Xa?�����-����5���"s"�H��q�y-��Z#G;{Yi
��=��}Q��UA�����j<��R�
	~�(��%�<�����}��i�g�^�dk<;���.#����t��g��	�CHqZ(���

������R�O#?D3�t�3�!8JC]p�M�5������+#05�dk�����04h\��9t��o����F��,�}
F�zC�>���������G��G�#N0W�Q�DYB�[���l{.��:~���JJE��L�h�Srt	�c_"������>�	�q�.��!�9�i�U/ �^������(:Fq�J}Z�}�x�#��^5
qM�\�[8�k�A_��4����q
5g��d����S|G�U��8�vg�7�|��@L�xx�R+�o����SA�0�s{>�C���g+���3����(�j��k�#
T�E�=�P��yO����������;#�R��]}����{�!%����m���ius0�O��'�Q�r������N��t���v�K<�*i��Y{�&��������<��r�_������w�9f���G��?/��%�d���P��g����v?J���u`=q��t��kTz@T����s�t�A�"b�C|Q(d�{FPG��:7���3Wpqd'�����<�N���/�����Pc_���+j���ma
X��(���R��'�"��
��z[����[��0h�w3���n�����[w�(zg��c���l��T��
pM��4��3X�#�{�D���{���{�gs�-�(s!H�K�����$vg�Q�K�	�r�k��}4���x�������h�,���=���H��h*Q��B�O^]���Q�>?U������5�7�����q��o
 ��M�$J�u����[���{v�A���!����h�_�����z���%�rR�A�����`#���`3����F��Y��3�-� "'���4�h�e�}f��/�
T��^���\��	f�����/����*P{CEI��Zh�p����W������/����Qwe�?M������s�Q&�:/��;�]�`��|�,��;v��>���>�������XN�h�X3�=oo�Kl���)@j��h�-�i�=�<N�)�+�|;��*�U�ISJ�VYm�T|����B�JI�~-�+?�J�Q��{-�Q�R�-��r�qhP
�)��rM��U��.�p�����9jN���1���XV�hgu~fo�44�`�5
�	K�]��|������{5���i=-�k���4�J,^�q.1�*��������W4H�0�`��[J�����1�xl���������o�T���*Z�
�L���>���0�f�t�Ld{�u{����UB�[<�ut:J-�-v49�)^����2�5�����Di�->��-x�<q���uLi����1+�m�����#VGj�}�X�/��9���$������A���ir�i���J���;^	��<b	\��5�t^<c{)��Z}���8����uT
N(��'���L�������F�����hB�G�������.�=E��b>.L������^I�A�My`7��#����
`q�D�K���h+�A��?��ctPx�f�O��^�\;hA�q�T��s���Q�Q��N����7�x�r��v��������%&��+�R~���n����a�j*�wf%�Ntg����m��~��>�c�,%AG�Q����(L0�`/�*���$n#�Z�����i����e�����-�����r},��ss�a���wQt������1��'km4����-�+K����C~v�:�Z���'AG l�����n<�+��*�FWP��1�x�C!���X�����B��m-� a(G�D�t�Cj���xo}���|�%�5'^
;��~���#�8,��|���O]�~�z��4~j~�� u,U�����q��.�N������������`����c�����v(���y�B���
S����K�������o��0*��O=c��#�<~��?�r��M~Gf�g�8"�.�dI�Z7a��a���h�n�[�h�{R�+&C��LA��/��#���*��D?
�D
�X��^;N�a/3�WQA�!���������cC8������B4����{.���TV�G7������/[�1������:����-V����p�����xiu���(������'��V���DK����t����������4�xT
���]j8^<"���:�3}���e$��1�;����V��������l�xO��N@,����v{w�,������T������w�R(v=��"��=VjI�|]Q�_�?����C�����
C�cy�,y
m���I�\x������!������E�i`�:u�.��6�!m�f�}����)	�P������zK�����M�0��k��V�c���Z�(c����������� ����#���h�{��9����y@1'�G�7���i��@����lF�)�ssS�k�)��k'5��M;}��l[�{�G������=�0���[��u�<��f�����y�o�(H��!������56vF
H]c�I�3#����s�$��2~�5�K3}�*���=C��cs�<\@��������/��[���#pu[��]��~���AY� �:��j	� �����D���f�#�l��M�����������p�i�S���
h�1O@C�n�'��7����Xq4X'g5t����e��D4V�����A0����8�#�;c=��6��U%����I�T+�������r��:�$xz@���Q��Qu�%�����S�O��g1�|UcZy�Dy������o�������b�>��0��:������i`dX�l|��Q[�C������r�j;v��
�T����p��]QBld�i�8�
���^��p�F��B��
���)�kg!H����� ��AT��#��]8�}r����W�+��.B�6:����5�<'5����A�R�	>4K��q�t�?����Ga6���2%�2&0{:UY�Q+��5�:nd�='^���W���r��5�<>N���]��H.�r�R����Br�e���
��S���]�����H��}�`;��~\�e���|��D"�c�Y��gx�|�)�4�1�������K$��>)K}+!��m�L����(��������AO��T�x���
<���m"��iKBLZ��X�n�P�K;�$K�a���]��:!_qf��3����b��d��*u��eU�v�U�v�g��YS�R[�q�������Cz)�QQ� ��(�']�W���!�/���k����]O�A�����y�����d�MtT� &�i�h�!>��g���5,;�&����J+()���2&}z-����`�[|f��KGI�}F|)�f��b:
w����!������M��M�P�g�����f�������&c�w����[�Q���d ��2WK"%At�7.��\�m����UO��-�}����3����TIQ�E_|�!����� �wZ�6$k:��)q��v����.�t�p���M=}���*D����,�d5P�6����������WT
��c^e�����nj�O1���"�N{�)9��>�<P���^�dt���&��.-���d/0klC���RH�K	Z��k-�jF�F��c�y�����g0�_{����rJ��L�F��"E#?Qr'I������e�RHUQ���|�y��.�|��t��"�K��6A��=%K^%f���� �X��u��]O:R�r���y��:&���8FR�J����/$�����z��K�}���oF���g��<���^wl"��r�����!i�P�Kdm�L�B�q��r�R��������*k�)A��E�����_���
x��[	�4�@�p}9sXLty#;�V��.!�}#��E��o�d)��n�_V�����������5"�RN�>Wo�bI�4��D��hg�3�n�6�%�7������'��'T�:U]o�^UAdi	�I��s���	�(?�p
�s����"�+X�����q����3<�{#�������|dX�+L����Y'���Z�/v0����U��"������}J[��J�m������H&0�N������8�k��#�����X<	"��(����Q)���
_��@��Ryg	��a�}"�;�+���08�A���$s8���R�d7a��j-���p��_/�� '@<�9o��'OQ�)!e�)��w������.d��^�4?e��x<����� ��L��A�M`b3sZ�g������j���EiY����l��[	��r���bq{��:�<�>w��JI���J=�'Trn	�{�~��=;n2c>���w'�tpv�\oK������0K�k"�3I#0Odp��~p��~�~��|��C<G �C���n�<s�5��g	Wg#o!�s�F������V����C��w���yP���.��<V*�2X���K����d;x��5����
h{q�l/��������];��;�F�=����K�+�_����\�)�M����2��w�[������������R
���L���p=��0�����_��s����u�����������%9�|���+����,����~�����������$�T%��IG�c�)��������h��sB`Y�o�������%>��v����	���;�U
��oM��|Se�-�NE�4R]% �������V��HU1t��]�^��)�����[�����|�����H��v
���M^�/Y�H��bh\Z[�I����-?��G�(�j�U���y�Kl+p�OQ�\�����-��@\h�y��~��]�F���~�g�0^�����w*���h�o	�s	_�������8t�b�C��������#B�i�wg��o��L-��P������#F��#�W7Ah��vp��\�����jk�@���%�����?s�
����N�%\��{)C�����b8�R�X���L,~I������O�!��}T���U<Gq���{^V�BS���n�NJkJ��f
;9�I ��#+�����������/t�����C���k�>]�{��+���R�7wt=#��a���e��X:��pp-��n�$��L�.2Kld�Q����(�D�����p���N���B�����{F�����e����|�B�OnC�"�u(�����P]�(a��L+W��������.;,#V��m�y��|R���H�P�c�r���M��|���Fh�.]���������X�y����qY�*M�"����`p��LVPD�[�tR�%���������9����mA�)-��������!�t��^�Fi���{<f��`�����������Q=/3��]��6���24h^���VB t4��@�/����Q.[OL����\*�2��t�����p������H(�'�N.j���%���*�K���\��d�{Pp�J(�
��+��d��q����,����2q�8������w[��u������S���|�D����/��]@{����K���5Z�^�D���b��J=y�,��&�w�g(�g����wo���������87�D`�i�L�'�{���Q��k��D�5��J���TX��c0��$g[V�rd"hw���Q�]��vK)�F	��i��<�������Q��N0Z�p���F6T{C"Gn�E��c�g���-c��p��$}�J��{���R�p^a�/���W���u�
'f7�Tk.H�h�#�y#S;������Z����������B[��x��;��>������NY�����>��+���h��S��7�(����TFS�Zq����N�?��/1g��Jz���_	�@T��te ��������~�v��N�+���e�.~��7�J��]N���������~*��G��}+I�"�Pcj������BW��:��y��/�����'�B����,P=���lT�D1��D����x;c)L����J�����/56�9��e�Sp-�������^�D����,��(p���b��1�0)��8U]��|�=����i���t����}x�����]G���z�k"��uS�3��'f�_��]S�o=�5����
0(�������a�j=l����75�,�3#o�9��QZS
�T:�7c��Fe���'���R�6Z��{���{�.�(I���������2#\���p��3�$��3���P�IF�'g�)U�]V��j�]���C�s�h+�F���!>}sZv��u����������-�G�7���d�����wki�wkB6�q������������#��RG�8�������7���rW�V��xnlmm&��6j��-[��g[I[��I>[�5�-#�-����?&|�f^|�g���u���A����g[m���I�I
[�cI�������W,������y�x�w�����I���������2��9�(�q����t
�V_�z�������s�-|[
P�:����/�x�-R���k������y���Wt�X����J�_�A���^�v�vf3rz�U:7S�^c���F/�N{�=|�E�=�������d�=��(;���.��zW�T=���`?s�����K<�����oD�fgV�K�6�����_�uvr0�RWtO	y��FZ�����7>�r�ue���I1������j��#�/%����a�5N��et�#2_h7����k_�SC�����?��������TE���^��~���������^�S``�F(�����H@0���[�q@9	��_�/�M�A���K����F
>}e�b�oz"Y��r����1En��k�vK����������e,7��L=hp��I�:�x�f��������7>�������F�5�?]<roy�5Tj)>["����vk���gQ\���e��J��������5c^��#
����@�.Z�x�&9'���Y�g������`G-�o8Q��K�J �<t�jR�J$��~��qx���|F6���Z����
��
]}��R��9�}�k�s���K�_y�b��KSh�����	bs�(v$>i,`��!���t�f����YN"p�D���b&��!���kw.�J��I�U�
���{�� �8"�����G���J��{���)$�R�lp�N�� �}<v ��g#�/0����{�h�SLD����a������5^	����y:9F'qn5�4����wQ���9�`�]�BV54@0�~�p%�~*��%��b�s��	:�;L������}/��D�����Ts��+qn��8�q�����b���N����J�z�L����c5q4��/WN��Z��7���Qnb�@�q��Vq�X���F����i�>1�5&�&�*�A�H��@@b�� ��C�w���fK�f��MR��@)�L!����ld�?�R��U4L$��(����k_,��}���8���+t
od�4�����*!8N�Bg��b��W\��sA��i��2hb[����,>������em�<lYfc�6lU5yf�����n3)x�W�"�I�J�>}��>�;^�;P���X���G�F��yaf��|�D��)/VY:,�+��c�|q��l�����d�{f�$� o����	����s�u4���������Y���R_%��`�D[_({�c�,�j�%`��f�������h�Q���e�����iQ���u�}ToA)�|��'�K�L�(c58�@��m<-�T�
.�P6�J����{�02��X�I��x�5����o�&{���y1Z�_9*��+��q������w�z;��FTKX�����{n1�Hi&)�FKf^�m�
J���c��nVxE���|o�2l^ecZy�6�W@���k�N\6Yv�*T�����"u���"@��a�O'Q�C ��U�T/���Y���ny*0x|�k�y�s}(CT�f����[Y:�H���8WaV3Xxc�OXLG
^!��l�����9+wpL������Q��{Q��2��A�9���
�L���L���1���l�,��/��������=���~~�w}����J��+:�e��%���~�N�,�^���X��z�e�)�Uv8���3��*o���Aq���yt.'���s� ���U-�A��7`'��0�2v$p�t<7;CP\6;��qd���"��/�{�����hX�������nF^2���T�$CGm��DB��H�X��Y%����s��59b2���W^t�&��rN_Qt�[UH�xq(u��t���5�~h�A�k��0�2�[�N��"���)��c~�&c�y�����2{��D,���7q��R����g���y�~^�5L��s�u��Io��P�=�k���
2��@����� ���X����N�sN���5�K_���N��;7I���p�-�W%dX�B�N�Y��_�n��u�<����;�����;4�n��	 �!h����w�N'���w�}����q���������W�Z{�9������LJM���`���y[b�`OC=���B+�m��*��t:�m���X)�g��D��6;�9�D�)���N_��{�:.�����8��������q<�=?��R��z�D���?pP��,����6s�Q�����W/�c#�I�Z��1�b�d�s���)��{�g������u[8q�|j���s����ek����#ij3�|��	u���p�|E+�P����?U�bZ�
��y�����`6E� 3�������W�.�4����w5�/���:z&G��I��pz��i��r9�C�~8	��O
������N��J*����K�
�����ZJ�h��/mtQP��w	3pa|l���q���
a�<����J�i�������!����5����9�b��A'7$���UKZ�Oqm�65�(O��
#�L��V����IZd���{Z�|�r{���?1����P?�������D�W�� y�"���:!�D�N��tc"���x�M����I<���m���I<���C���
�<G���{��#��u���S����*Q*�t��.�~�����^JB���~f���}R$SA����}�|y�:��?�|X�����({c`
����j����)����-���zD),/.H�0���z����T�j�����"����D���U�	�_�M:����1jkx����������^[#+MI�JK��9k�1@��hE*�K���2|�H�2#��;2V�����G�(=6U�v��.[oD���%�6�J��>y1R������-�+��RH��
#�������0���M#q�T]����q)J�7�{�[c�1W��l�	@����s��^����iB�7�S�K�I�L�7$�W�xy���X���������I������o����)� 	6)O�2"(�=���d(g�j��f��J]J����BeP��"L���Oy�'��[TTU��o�:�R�E�G�����}�A�,�f��sq�0����M��R�<$W����NCq���a�v��{��A�q��%U�+��w��m��;���oJ1��X���w}	�t�'�D�_cSy�u
<_�I����{�o"8�)NYc��>sib���_gYI��|�n�"SO
=����x3T�����P��"	���q�5_���n���X�������1��%�_5��e�4xA��V���U�~�/��;^mSgxN�6\>���I���+w��,��G�	C�Y}���[%��='`���Mv�����|6+��)���f|+�\x�.�Vs�X�h����9��f�\\�+"��D��r����m�&]��=�d��j��q��-��KL��z>h����cq
r�=9��gH�5�0n4a��b����m9�2����r�T�V,��oM�t�R�S�u���j9sQ
�U5�~��EQ�1]���kI<��F��j#j�xl�����R��	%�Z����������Mt�{��u�l��L�������-ni��gZ�9�
>���(m�
Q�`Bnz�m�<��'(�����������?�C&�;u$W4�2��+��C��Kz|E��U�T�vk�I���S5"�QX�*�B�B��"�<�<�{||�,W����/r5cG�����}���L4��������;X���!��eO0�gD�~��:;I�����0Q�2 ��v0C*Y����A@A-Oyt�wh�"��B�Y�J��V����?���kyJ�e��3�e���(*�<im4�����a�n������-�����%�<���j�X@�?�VrB��!]_���F{n�Dg��p�.-������$����/,������Dq�%��o���Fe��7����<�����7��$��}.f+�A^���xQF�
?��������)Q�OZNBf�V!�b���G.�VU����{�������L�	��-�����q%Z�����3zs9�����#`���L�< �/��1cWv�-�\���G+�5=�lm�J�vL�
�Q�NA��x�{�T�������9Uj��m��5XT��?��}��	��X�L5J�[�m��vM2T/����s-�!�%5��zF���J3���S������OB%�IV���]80���!��!9p>�4�P���2����8C��N����/��0
���S�B��i���.@U~>TeS����C=��4��^��t�E����t:+����Lr��P8'PZ�Q����� ��p�<����	�~�y�E���CM�]��E�sci4��S@~�U�.w[�+��w���&���~a3�1�JS��H)*IFf�|n�(l��ut�l�Syj-������8b��9�-���E�~7�A�`��*���X�
�����Zx��HV����L=	�$z�p�-���bysu������v��N���?'�K6E�o*
���x�p\�L���^���6n��QW^�z��I��/��� ��ZMj��D���\��������l����y��(�$��O��O��E�(7<�F&oym�[������T�tj
�8��J����������
|t6_�����m8g��c5_�9{���!r+o_XNy�.*MvQ�h�VPsB���x"��L��Q1�e����>�70}�WV�s����G��9&�l��3��t����s����/�Q����b��=��N��U�?���I6$�I��j�����RUX$50X�!�y����J3�����h�����~�n�Iz�c��:�U1���
��K06�H:��N:�}S���#/���BV��g
;c�T]-+e[d���I������,1p������3�o�)�2��l�E�i��&#��dS�����[H�,�����{{��P#QOB��������7��%���
��T�X��6V�hY
�����z�-J�!;�j0��Th�d;���]���Voh���[{�k�Do6[������)�������Y���}�������P���S"�:����F�[)��M�w����;���s�-��f^�'A!J�t��	-����m�����^(��f�D��WOr;d}�9�sv;����1]�����Zn=a�
�.�k���Y���)a+��mNx�=���h7L?b�>�'�%�����+��B_���lO?s�6's�"��~�wx�/
����q�jVA�@Nq����[�������"�K�wh�	�x����O &�@��G�=�G������X(;C��	��I����������I��%mR������v�����Z<O8����fl���QQ�'��L�;\��!sl���O2g�Hd����*m��x���&������?z�wEN5�u��Fp�z���&��k��.sZ����o��m=7�
�E��~����i,��7tH�t�����],����~fS�p������,4��/����hW��F����^p�tt�d/�Oi}C������l����?�����8�v4��1kNp���7��~���^��u�C������6Z��=�p��\}��?���#�4,�y}s�������X��I1�D���j���+��_���o��������.P;�
���D�;��7�P�G��v��>����^�W��>~�/���sO���r�}�L����N��Z��V�s��oo�,���F�u�0�
���7L�w=@9���}�c�a;����J��cY�w���~w��qk~>�����ba)m�w��O���SA^����_�����~5t�
�5�nf�u�HL#�$nw����#�3�����j/��M2-9f�#p�gg��\��<�P��<;!����z�&!:�N����Bu��Lm�]QHBa���3B��N��iJ����?���t����DUq��jM��9Pz]�&��]n
���w����U0�e���-��e�l/�H��e�JD�62,""��i���iS\��%+jsA�?�U�"r��b��vO�����/j���V�cz�O��������T8�&I���
t8}EY���w��r�
�HV��J��onroSo����b��	��Z	�l��*R���J@1H(_�����X5*'���u������z�*�����o_X2���S�y
�f`J�J]�,����o2�����vT�U[�~�_*��������"��yR()�ohe<�a�5����X��:}-��[�nW��<��C.��=�T!�Wf�o���d��D�iD0������8�<�.*T#���\��ss�HB�_1����7��t�CM�+(?��h'���:^��S�h4����[T/2�Z��a���]�#����PV�6^����dv� G�{
�e������{�x�I�=�z��%{�h����w�3Y�Z��h��qb^�
����,���H��I����{������t���,�`^�s�cIS�2�7RM�[�t���}�3g���nK����l��Ur�;�:sLheO(��*�3���H=����S���wI������n�����5���T*��Y��o�A��N�o�q[��H��<���W,HQP���6���%�/z��W��,dW���<
=��i��P�V�O7_2����`;;�p4����W�VX6�/_�u����
W�u���/q�<^*4���G�	'
���l�O��]Fr���������i��c�3#��
���u^�����P���)P��S�*b{�����2�����]���N�}����|m?
S��Z
����������%ZX6:��x����Ym��B�0W�����I�C9����]��'������=~�>��I��t���I��9.�������['6z����V����ngJZ����!�/�����m_�R�����l����������R���6�0MLZ�)�8��A��a=u�\��V���H������o���z�{!?�*\kz�����K5��"i'�_������r&�JK�2��*��[���*5L��g$���QJ��C8���k������Km0�g��<�4V+�����0Rd0wek"��.�,������W�d5S:�w��U��X�G�Z��j�g���e�p��F�5�h��P�hp���A�����AB����6b��?��:_ie�kk#��U��U�_��LUH��\t$�����Wv����z�%�Q�����0N�`-�(_��s�}��x�����R�,��n�Q��6�"X�R}L���9r���u?4��+�.|d'�mP)M���%C�`����$E-u~TutX{tA2:0g���15�����M���A���/~(WL��Oao��_���Z��.�,DB��7�-W��������-���O����TO{��O���<��0�����U���Z�y�VaE�3��)��)���3�\�����
z�8L�:o>2_���JV���s|U_�R5��/���)�����������U�HUK4�T�.g�4����JB�]z�U-��Zu\\���M�JKI�����R���`f��m�7y�*u�wr�7����>�8eS��+�~�����i�,T��l�����q�k������  ��VGS������{�&������-�)��A���[oEa���>v�E������3���Rh�hb�M��M1���/��\���rmg?E��P5��rjC��]�4���L����Y��y�n~$8�����
���i����������LL����E�W�����G���P��H��W�O p�N�������D�]��sM��RR��1��P����$�0���	���q����z�b*�eH�$������R�H�\59�Bde�3�5N�;q;���%L��\gEa����t�O�0���������<�o�t��l��)E��WK�������[���4�]a2�!^�<V/I�^�)�dq&�,�oB]�a}��W�Q�|[+$]V;�N-M8�y�M��+]
����kZ��d��l���{#�k�4��P�8���^����������;��0A	+�����[H�BC��T��Gj_����)'{��7S*
ze�\��&��>����5��"�6�s���f�YC3"��3�u^tV����(y6a��|���4�c�����!Ft�����{���!r���]g,.!jI�y��<�>O_T�9��t�[.yR���!��FL,?�
���5�29_�3��"/�	�����c��{�L���U'��X��#��]
�.����4�����_{F�_�)���+7���a9�}��oS� x(���/z�b�����&y���aI_�V�v�"lj���s<�P�3
k,���5�[N�{�v�������u�����N�~b���Y~6'g;K�W���~E}�����}O�Hh*k��D�~���h�QlHy�
����\��Y��+w}�����3e�r�}[���j�:q� BG��%�r%ME�����L$a�;|(�mr��*re
b��y/���<x����O	���T0h�Q[�� J����d��)4N=�u�zm��������cMakf���P���}����FEP/��^���`��h������Qor+ST-WF��M�/S��N��$��~51=lv"����-�������zW���H������^�W���2���,~����~\�})7;���������~(X.���D�;�U�\$��A�|� U�^��;�����)���[�w�����kQ$�T�}1^j�%kbd�dB�/�'����t�8M�J}�����HOyBy�Y��=i6�43��C:�IW2p,�7���b:��#vg��Z�s�R�qO��'N�����ev��f��u7j2�x��R�"^l�����6�u�����*��)����^�dzY�n5�����g�b�z�f�X�!w��nKC���8S.a�����5fr�����Z�L�}G�MQ9���&!�}����	�^�r��!y�o�	�I��\y���L��u�w.OE6q��J.��;M���<������_��f��?��MP�����s^��|j��Z>_��<��}��������{6����{jw��)���O�g�s5�V�'��"l/�4O�t��~+����g%t�7QO��� �6Y_8�i(�$	�����8!%���������H?����Q4�^�����hk�[;���r�1axa?{`��1���'�y���N�X��H}�����b8�����7�l`wc��������5��%}���L0%MT����5�_r���fS�N�]����p����^�����-�J�zce[v��o��MWq�\3�4?���}r���o�O��)�B,�������
~+���UBF�$�m����-��Wk��w��nO���]���I�R��(�lX9a	�����-�R�,�n��������-�m������W����:���Q��~9~��F%���� �����-'�O����$O������M��D�g��TBGf����}d��4����)	��{�2sTX��o�t�[�n��^������.Q~an�����E�B�{���_�����q~���������^�g�s�e����[������A/{E���q?�?&�f��)�L�����M�<�n�~��2��C1�G�>�������y|�^}��/���j���t��(�/��Y��/�X��E2�����OV�����uXp��OtKt�U��9GU�	_M�4 �{�CK
)E��P�#��������4��m	���c�Y��c1\�R@Nj\���9Kd�C����f�S4�X�������U��`�a97:�C���DO�mE)�Pl��.���,|��EM9����Z]���a�(|�g���.�U�gt�s�:��tM)�TVO�K�c^)5&?����tV+��_���J��c�9���_by����~��S�cw�(�O��$�}�	��5w:�����zqN��&�?���r��f�����e��4:��U\���7�y*G����z�����B���_hS����M�>��������0{��of���j�"�|���_����������dY���� �*�k[��FF�-WF(�6�N���\��e�+?,,����D��r����}sS�jB��M����5������A�L�Lh�>�@���p(M{z�8��n�q{lx�/��|"���6��9�,���%������m(#'���&��[;��B�?�.���}��&���^7���U}��b�]��5��n7�������x��*�2!k'��V}�bA
J�[��zr#�������2
7�^��e�y����H4���`+a6�]}T��m6m��n.�@r<e����W?�����c@�Y�?)]�~E�;�i��$J�M�o�tvU�&_�'Jp�^j������A*�f*��a92�Tq
f���F��C&�l����G-����G7
�/�`y�� ���4�U����4)���q���z�6��?�5����3h<�
('�=P����H�Y���h/�0�Y�[1��.��)��s�����.h^8�l�k�iD���Ex�;"���
��U�P�'I�~�����p������J�f=1,$B��E����
�3�>���}�:>lb]DoC*I��%�;��������gl0�|�#�����t%�
)
��]t%������"F�R{��B�w����|s*�v�T�����B.����A7����!d3��1Hn"?����N��K��5���8�v��nL&a���[f�RIQ{����T��P��:����
�!O�
E��Z�y���r���uj����k0{���q�Qqr��q��s���V~$������G�Z�y�My��y��&�������<#�J���1��=��+���R�7�Tj���C��k����6����,�(����R�89�v�-}��jD+e�T�i~XK���B��$����������XY��Ei�S�(���RQ�p5�{<��P�Q�������T��'�R�j�Z�
��i9iL��:��,0�F�2���Hf
J��Q�Q\RU�_YmKx��g����Jm�gNU�r��~����0?X��^'/��a�����?�]��-6�e�W����?������4v��
:d��c�s�Q>�z�I��Wr5E�P*�9L�c������������-��������"������o���l�2E^���������0����XJfD
����,_������l�3�
���F�{�(qc�D]R�kn��f0!�-�tx"�tXE}�
��O*���+\���7i�����.�j[�b�neKv^aj�B��b;�'\���x�cu�F���?��e�1�p ��m���~j�7:����n_�~w�%��I�T �����3��������\������������r��~���A��:i�|";}a�~�g�t�/o!�����n�6��
G�*t�2T��
jK2��c�w�,�UM#��5X��*���������jj)�flW�(u%mw\_7�t�U|I��Z��QK������~����!>Yjq���K������������7+C��\#�!jRep��@�����AO��0�Mp��;���J���_�F1���B��T�S����\C"h�����@X�u)��9��~|f�f:i�u�Y�0��?
A���J��J{	���J��*$#Mk�����v�jJ9����4���M�
1�*��#,�Me��~���������q���.��~`�
�b�_2��$�C��s<Gs���<|�JF_��Zlqp8e$�9������/r��W*,L�.�Y��dN��A=b����5.�X��"�N���.��Y)BN����!>�������fc���z��R���_��6/���<Ox�u�����X,o#bP��n]s?�w�a_\�H�b����+2�XK&7L���V��o7��]d��UJ'uH�����G}��O3�KT����������9��Y�U:e
�k�Ree����;��Bh�'��sh�6�k��$�M�dlcZM<Gq�'��@����6|�N)�b�.{�b�W'�������}��N�����|t���B��aQ������b�6��
�&��HYL��bZ��EAv|������L�����G>e��&Z�������������N�;���r�_;;�
�_>�q4�����V=��A����^8st/&�	xx����w��6QL������+�R�����B���"��6R�-6��|����b19��&fe��������s*�]"���M��^���K���?�ef�z$I�L�u�1���Ybt�?��%M^.���5��}����O+'�Y0���iO	l���U��u��\AjMA�E6���I=��(�@[P.8���5(J���3Y����O��_��r�����������,I������{~��~�+M,=d�$>=2�h��+��K~�w��V*��B�h��mG=�Q�9����1� ���t�X[E�8.k���J�W��K5T������Y���*'�V�����.��`>H���q�L�QeM��q7��jZ!%�'�zd�=U��+x�u>�(�
��	�1�5��
����S�d�Z�K������584*�m�KS�\������D
���Jg����Z����L��O]=Vk���N�����\s\_b���v��kxC��5���W�F�������E��{��UK�{^=��x���G\*�b�)A(�"��>zi���CGE��ve�h2]��t�H?�~XZ����H5U�#���7(����H�����G��6V�6R�F3\.;��I�.g�W8�����
aH��oim4I����j���"n�h�����;�
-j��v*5�7�<M_�y�������
��R���afP>>���d$_�*x������OiK����MS�A��\,
H8�����![�c ��lF:��7�m&O����%�
��\4���F4���%���s�R�q����{[{�������*Z�^d:�p�������=��EmL����O��&���T�e�G�y���)���N���O��(�myZ�EMm=	-�_O2\N�1/��z�aa[�=0��nA�p�p
@�X��'M��{�p�r��w����#���)����Sp�B]�0v�R���1�}�����R3�\h�[���|����[�y�����Kl`Q�H��_X�������:b�:w��L�����;��L9�j?�tKQA$�����b]D'�B������<���@�c)wHN���
��QF�������|h(������M��5���R��k�}�-1������&�7_��v�5ROkd4]����K������T?��y9�`�s�S����[�~o�O;�u���5P������(����r��A����|�	x�����.1��
j�����A���m�fs��?�����0>W)"�~�i�xz������$�/��5�`:��B���qB������rJ��j�8�"/gs�t?�5�.�����~W�*��7X7-�i���;iq����\����Y3.�z���y�ho�����V�l1$��r+���tP�8�~�����C���L�t�mE��fX��vg���l
Jbv�i�(^�A��
��VW���������D����dH�%��-�����<�h�v���Rq�,�w��q}q�:�)��)�(�����/Y%7���O����]?���r��V�����^x�A��|O�$i��b��3�2�4�a����?Y�gEyID�@Y�B_v���:�?�Z�����}��0�Z�����c<J�����I����T�
�n�������y��PO1&�Q������ZR������,{�Bp�_�
/�HqA�Jc�R��������n�B;o��{��{K]���q�h�5�k����/�7_qH���^�]X�#������UM;SZ����V3g��mk�W��6�:$�,�\9zOvA���k]�4�?��r�|j{�G�-��@&w
�xa�Qa�f�j:��<�W$��ZO��n�������}I8x����F��[�G7����"R[vV���Ug�kRE)�X����vd\
��`,�f���dk	cU�:���Du�_*��K+:;B;�O�� >�b�����P�u�,Z�m��v�b� �t�31�7#�%]�=�Y��	�3<��S���*�k[/�mH�`�x�p��Y���mz�������f�1���7�ulgl�=�X����_�*RH�� �u�*�.k\
��������e*4O�F[�B��lt��E�K2S6G���&�+����=����m$��x����R������E�liE/�m�)X�Pg�@�t�T5^k����v?x�vQ`��v��W�F�S:��x���{�j&��W���C!��/.u_�P�QhN��m��_I�yx��H��{,�������PJ�v1��+���2Tz
~|z���o7�c��(n����(���&��@w02KB�7l���S�R%��KZWr�'�Dd:2����R3dY�eN����u���
gb=�g2���);[����l5����_��l�����UU�S��{d�.��( F��eN��C��3��<��f�`��Ks.SZ���%�NxBLR�lK��������p�/>�,F����?3.8��&�����6����|)����$v�����
S^�9��E�����VX��m	�g��6�f�1`�����=������L�o3�4�[��r�������o@�'���Q�������q��z���'��J@��/��9��4�65��o/t�~[_X�i��Z
��O����)
��W�;;��;�Q�����0|���y��>WC��2u����R���iuw��-d������ap�y}t�3���j�����p�)��L�����Hl�*�V�+���VEf]���-v�9���*F��/������:�UqGS,E��7i
��a�{/��.�_������gA��o�����t��+�B+�����K��������?k5b��2�
���zw�4�EF������)
���
�}�Y�.��r��L��,D��Du���D��2�_Z�����F&zP��=����:=���3����O�U9�^��n@��n��o�E#G�6/oMq6��������U8I9F��C�II��~�G����D��b����Z��
(�Fo|q\��r����Vl��TX�3�qs	���P��"I����qB�(��D%�;��E=�������JW���P���.!�	�Rt�����
���,������f����(C"�z���|W75!�Y����,�L���������8�c��=������v�/�i�Xz��������n�}�Z�~�C���CR9�'�����s������#??�>��k����m���Pj,��y4�:���O����XvFZ���������)KZ���"a�.<�`��u<)��F�gU�6��L}��8������������N|��]���]���K���=�(�}�!J�L��������iV���^��LpW���S����w�L��{A����w}���l�������V:x��w��U��v���;��k����\�W(S��#�r����=�dZ�nx���o��^m��pyI�QWC�c��O�����Rk�U�]'��p~U=�����P$�>�
Ca{��<�i��C�z���2�/�KL6t$3�����D�m�I�ZJf�����������mk�*�6.����pG�
�h�L�G����@�,���&WT����wV1�O�D�C�����J���������"	����IU�D�w�Fk�l'7��,���E�h�$I�bw�s��<ZhNU��oS?j��4��Pp�.l��-ZV�����>�����)��D�|�Mf��iW���#�:���z������$�v�_G��f��0y������P���,2��JJGJ�\��5������x�/�G�$6�c�����a���?���{M�F��O6������6W��R���=��,#c��aBM�B��X���h����am�)�(e�k[�>��]�������_��K����toOa��_����l�6����������3�O���W����k��\7E���l�\=���i��)~5��	����.A���A�F�2���F��d�9�f��8�������gZ���{rQ��W��@�k�"��>�U_$��q�s8��k�j-b���Pb�/��4/��kc�������K��ew���������Gh����~�����b�~O��T�O���>g�-�k7����\��?wO�O�YO.�O���t[8�k�g�O��8�����wO��'�.��<s���+���?�TO��=�O��%�6Xi�Xk~�a���Of7x��0�p8�9���s(���|4�<n�t7C��1�!7�F����*��$Y���U~���_��OP�����C���pw�����"��������
�{�
�c�a���i�u�;!�3��i���}���O���b����OT�b�N��O�ET���N����~.l~�?��r�����';t�{U����i�����!��A����y0����x��������c����~�#�[�r����_o�o\�A��:3�{�x��������-V\g�[x��;�{QoO�=~���Ms4'���{���E�~n�����o���r���K�~�8���g��Z���v�����zX�rX��{:�x��+�B�*�����Z��x}�������g'�@��_���O�x�~���������&�x�c���5l�C���	�n���ci���h��z?|��L���f���Oe�O��x3(�A	�����H8CcOq/|�9�2bX��;{X�hmA�=��?B#��>0�v��}�s���j�=�Om�	�����q��a�c������Nh���/��Z1�����3<\�!D9��>�.��>�5u+��}���sc|����)����T��/=��c5*lk��z�_+�2	�j�0��;x���o�������z���n70m{���o�Xq=����*��?���n���N���.�Q�����rf����Ir�������~y�
/����_U�<V����o�p�p.��
��Q
�4��X��>���}��]�5�vq�i�������a�{��{���'X��E�8����}P��N���� ����z����"Sx�g����t-co�L�[/��E������w�X�������!�}�&,�~`�YZ�_�j���]�'Y�[d ��M28fP\�+&G����G��b��h<�j@�yv�9��)G����+��i�"R��5zG�1"�(����3Rt��"�w��8*��
���2�\��	H����!���kB*?b��FGR��k�	��`5��0�����	,&����T���6�+��6m�^�d��jW�>xs�����~sP��H�`v5��82��]i�k����J�y)F�mA�	�[h���!��������������S����>%t�������������	��\����)���D\��%>�3
���X��z�S��~|%��{LXx�;S^�������%1|����x6t��G�S?y���;���@@\/�i�����h/%	��
Z��\�K���������z�#JNz�3�a�na��6+|�����O���tsTy��!�����1,4���
%�xTX�[`���Mo���Sw��K0�%G�g������4R�A��-��'$A�-�R�b�U����}��T�.A��� O���`Y��k�Z�H�8[pkr}x}�5��;
������q|����l!x��=���3�t�u��Q6,D�/}����(���R�����#Z��!�f.���
������e�(��4$�Xzg>����
�����_M�������iLkN0H����c��chB}��qZ9�!�-�C=����E\U"�����f�����L�*������2�8h���������iD����1#R���1a���_�S�)���z2�&��Y*s���?<	@���4+F�l[���5�7+����e���*�/���a���������vu���X���>����1_����-�V�P;J�����3�|/9l /��d`����"�M�Q���jF*;���h�����V�!��
��
�;��c����n�^h������0�h�YQ����#���;=h�9�,���{sA��$���k�2��@J�pS@dj%+9��`os�Bw1�t��;�����]r�w,�^Q��d0<�I��"��n�&U���%Wf5��L�D��z��d������^�m�oq6������R�`_3�4z4of�O��m1��A�X�#+��L����4'<����MJ%B�����J��������B��#br�'��x��b�u�����4������3�0�u����W�YlD���:%O��/��T�R�8%������
��_/I��U�'�Ir�
�_�dZ����4���T��
�� F8-�������N�y�n���K��a�W�c~��&c��2�k��5�,��������G�������
�YZ�����T���<�P��*�FA[�#��"i�Ea����F'��Q{���yp�������5#I�����;Q�m�ri����r�:��J�|�D�wD�ZDm��������fCV�{%	���.x�/�CT
gt'��9�8>\����P���T�,��������$B�KP�h����ik�m�
wY��r��N�K�xg��~���JN(�*�p���obX-����8����e�0��q�ua8�	�O��6�^�����*�"��x&��n��e��^��J�w�� � �"��mH�\K|s9�A< �I���$�9�a6xM�B�D��Lk�Jn,B�jS�]�v?�n����9�����_@u�Q����2}���J�G^��V�����$�Ou��_q�gq�c����c��.Y#o
���m=��0^��! ����,��ei�#���,�C��W�iQ��
�_2��^���G��;�=P�6���"M���%��q�w���^�J�Vu7.����������~�?-.�����#�*L��zhI`����C�����3@*��d�Q����L_�nG�#g4���� R��
^T,:=�:���I"�,�9�b(��!R�0���MK�
��-���#�n���_
���V��'����&�&���#��@��!n1@}h���K@U
����Bc�mH 
JY�=��W#��eNt��m��Q=Y4���*�������S��Y��*�pe���c���qY���������?�V
���]Gc1���BZf@9Q��Nq��c ��C��8�z},Z�%���d��X������F�����u�A�,{h�������������k���Z	���^;-��@�G`
@e��x
��(����a����� ���&j����B4�hI��~=cYp��Y	}
��7���r��(C��0�����$O��D������A��<�\������"��;
��j��V�����_��_�����@��k������F*�1�j�=�VgQDbE����a��T���A�NANuy�h���H��*�� �&��(o\u��,��P�!$|k�����
��1M��^�&��NK����';{��U��8)���E�B�=���m��W=��,���b��R��(q��M���E�&��=��T���~H_����i`?+J�
������a��wOSI;���E���j���#T}�F�d�'��n���R���@�SB��u���#���9��G�7l���]��[���������H�����."�+3Ab!]y��� �R�?������o���n��_��t�+��Un����@ll+�]]U�.�t�V�:�5?o����7���q��%B\��!m��	��n>�;��n�.s4�4�����b_@)1��PX5�����_]����	��������,e��K���~]^��#�l��&�
I��=Q���/��p��x0�&/���X���[���<,P����FV��s
��!���A�(e�3y�b�\;e`�����t��0�lF���V&�D�q���B%nl��e����!��D����i"��"�0���������O�g�^{P�����<+,r��a�\!��>��x�W((b����#xR����];���0��"�2$���1#T����a�N������t�onRP�t�vK9�b���i�`�G1��B������2$��>%@���d3�1<W ��!�����:�����dVH�k������|i:z���3=�6v�r(�"�wS�t��o���|p��tS��u�0���3������������/i�������F�Ep��������bG`QG��f��!ib2�FBD>��2�f���mK(f�h}�j/�{�z�F��u"����5��'�fx	b*����B���@���t��������w�2�IS>��v=�Q��h�HS{l�n����\v4�#,������?� �0v��G��[�Z!h��+�}�
J"k<w ����G�����������5N�V�W	������
���7T	���MFP��1�b-��,6&z:�7Q�so�sGj>C~7N�,$���e�'����x_ 
/Bc>�P���]a��v�@$]�=d#���{�	Z������E���ORF����-�l	g�L�?��&0���)�P%��JH9p@C�L'X>b��u���6Eaz�2@@(q�aY��a��<+����>�%!!T���U)���������#����%%���D���X=�V0�[@��b�@�����#n?-��H��RFpO�e����wJ�nVs0��M�:��;���5���[9?��-����JjT��+q>e!�z#R�Q�>x�����&�j���U����3]���x����a��t ���S��X�:V���o�����2��b�G�R�����@��%T��+^�Xzj��Sac��+]{QA�N��=v�[9Y����$F���8$����d����|e����oo2�c����<�j����+���	�����
����1�V��]5K\hSb";��=
�%���v��{f#K>S�7z�m����$6*l��`SL9���lr4�� ��#2	�
�J��)�WY7��Wq�������	{a>|��M�u<��f����A��K0�8ay�3�M����b�?�������?)�w��t���t/E`?�e0��6#�� "}�W��/��%����/�P���n�z �H(��ze�H��O�
�����>��-���������F���\�$q��m�p���,X&>���HDC��p:K3[V�q�v��#��
��GWV	�rB����+S�T�u� 6�CK6
a�'�����l��h��?!����

A��Ks6B���mm�x|Z�k:+����l�N�Z�2+Z�W'��k���D� p��'n��z��Cn(��3��9�`P�6�hdg����{Xg����[,�����+v�8N��t����5�B\@���F��U��|��VIp��l��1v#��	�����@�(epx�?(�\���	F����w!�����2L�9���"P���S��>�$
A��,����jk<�l32�;_8�WR��V{�@D_���A�@�Q0cf��3;T�$��t��R�
��c5�z�"����Q�;���rB�
��A���2������>PVM��7��
�<O��de��

Df���6a��*���6��7����&D?DnS��������"(���-U[7U[8��R�`����,�=O0�����
]��'j�H���{��1�_�����[8z
 ��}�,S
�W3��
���J/q�f���-s���m��'����1�L�����pD*����`���#�a�����&>
���1%��,���H��R����T �>��{�<��)�[9�1d���P��l��N<����2�i#��.,R<l�kd��������Q�4M�'2d���I�"���'@X���g�w��xok�^��|�C���FvM���
�-���c�~�rt���i$RS��c��������������[�:M�QO"�m��x�pO-^7!�>|�b|��(<��I|��#(��R��UNN���q��%�Q'���������2�|�"4�dA=J���!m���%�s�:��DW/���'0�C�x��c��B�7��vh�,����!�W'�e��,J#��lC^U��n��b�3��0�����Y�q�6F���:V�k����S.�T���0�����i���mz��]@��%�-���l�;#� ��C�����u��A���jo.P����!�y3c�b���=��R��������"�L��TG�.Wm�c�	�����r�"�����G�j��rF�j<����*�%������^�x]r���-�������X )u�0K@C���W�!:�`K����I/�{�H]�����t���N#|���r���>lpP1��8��X��1��")���F����x���d&`/��wa��&[B����k2"�/�t8�?��~6%�h���"�|�_U�R�����A��Z������0���������\��+=�D��\���}G,����&�(�ni��0u��[D�@��D@-�Id!�������tAH��,��ax)��w|X�%���@�$�f�J�>��=�����4w��A���=��M���4����������#�8��E ����MW�.���T �A���i�S
��N�1�w�@W'��K���D�o��byr�L>b'z�
��p�7
;�E��
��@�����|�J����r���:]�!��*�+���GG�@2�t��:��d�)yvO	E!����������qJ�i��#0��Y������4���;���*�MW*)y�b#��������z����+�]d
�����ro,�������~:\xq��i�}�:��r�U��g���)dmkHK��C��H[4��W������)��*+�}`���s>�*^������'r��1c=>�Z��'����h�2�����A����
�^�t��b� ���;�C1���!`^G�8'�	 ��nBO��M�'�5���l���4�-�`�t>HB��1`|
O��B�f����k��{���r��@���o6$���R��r:0�q��P��s5$�Xp}K��c]$�������6;`�L�'����3fr��_m��[�R���M��rg��o������C�_������H���.�D�Di�����~��~�>eMy�.|��q�n�S��ws��]�.�#-��{��Y�+�/a�x.���S�����,������I���&"Z��BLn�=z]����H���6�nN
�/�s���9�w��0M��������e8�(����n|��;^�1S�l`�UA��x�k�ko5C�}l����y�/A�C���j�WY 0��E�'4Z��r1n�����@H47�(a,�������#o�z������Vs������NeJ%������co e�}b��C�a�	eG)�1�H���pN	��>#(C%�Ts�h*&5�s5������ %�F��
6�*9��;+������/��.�0�,��Y����D�����7������	�_�b$���q�c�Hi������A�������\����_�����K����V
!��_0F3���4��v'a����n����@2�%]�������P�n
5B�?��QPK��>���
�_W!N���U^��R���z[��r�ho&�J���+��O�I����9�I�k��}��!�@V4�u��+tv��y�V����!����2W�8{��!�v_Mr(NfQ��a�:$��V���cy�/p;n9�y����Tm�T|8(�8����m.�����+(�G��C9l�j�����7+I���n2��b\A��u�x"�^\�D<��28��������
C�y�]Ar��h��]�A5��l!I���v�B�
��&�����}����	F&����VB,������:��z4&a���dPX��2�g`M~I~���S���^"5���B1�%�)<�D�	�'~"���%�&
�V�d�d�`m�ZS��!�����d0b�RB-���������/�p%���vA��e���&�ZK��W��F���Y�V�z����	tA�����1:�W��7E1l#+�������G:��|�r���pC�f��x��=^-3G�
��I�=��{x�;�GS� ��^�$F��C?�v�����F�%������X�M��w����0W@��&I&i
�ve����8��Fg�A����w�P��5������F�-m��@@Nw�6�n���_3U�w}�n���_zVX��u��w�Z��	s�[�f����p�4m��mb�\J��X8�s��L"9�l0�2�H�~p�C�pO�o"$@�gD�6%?�4��h��*IUm���2#�
�����5�?a%���?�%������k�^�c��FC�u�I��r��m����~����������|�,�
^��9��y��������������J�p�[B�a��I���-������������������N�\m�K@����%}�(wcT���6����������8�<q(zzz���^A�O�����_�����7�)}�X@�0���a�7�����eoC%�&�o�V*�pL���P��8�Gp��W1v���"W�9����:8�h�pS6���n�L$w
Ah!hih^����~�	�n�
��������;4��~ YedBKnz��DlU��X����
�RS��U�{
�� {�����|����	�>�8�V���v�h�-	09
��	�J���B��f�O����J.��}�B�!R�+r��(�=��\,.�*�2��jB����v��G��^6u���c{�^�E�>��g���5(�|�`�2�X��	��Y�������G%��5��������,!�����6��q��'���5��@3]kQ��M7z,�����R��x^��V�^�1�!���'�����"
�r���2���������e?@��*��!�%?>��(;��gQ�A�������&���,��(��v�:]�~S2.yG��$��b#���E~����?S.\2����Pk�?�	W	�TQ�c��Z�7w�}�u�V 
�c{<`��Ez���D?H
���S7]���l�osS�v]�����j�1�>3Jv���!p���(���
��M*'����p<~e���Iz�
��%?xi�4�g$��sDu�`yS��! ran�W�0
����r�y�ai����ZF���bX��<tK?�"��+�v���W�rP,^U�K5�`��DA2w	=to$��#8.�&��`���^�[d?�%9���*D����M� X�o�����R�1���%fF
>����!������I����c�?��!�_%�c�Ml�����;G���+��g���r�����q��0������m�e!Q��X��BP��>^~�j�0�l`�	�Z�Y�r�����x�U�a]x�ri �A�CV�Q��#����@r'�
C�r`c_E�����l���T@�@�h� �����C��R ������7o���&�Ni����Q���
im��(F�r���:�f]n�z������#�3Kd4�?��~��j>27W�&hQ4h�&�8K����Fn\>d;F{�s���Q���k@b����k�c6a���l+�{���|*�><j�����tD.Z���bW�=Lc-�5|^n���<��������]Y�OpS��d=���92�������{�d��V��V�3�s�-`���n���wm�k�����>��������������:��-[���y|�n�X���q�l���]�wgM���������x�}D��Gk�4����N�����e���^�"[��/����G�^]A�� �����e?4����7������Z����;���5*��w�,���'~�9m�p�T���
�F�%^�|�����Ev�������y- A�����,�����k��o��3vax�������n�����Rr"��7�F��d�"`�LV��@Z�c�_!
jw�DB���G
�����k��~��jC���@X#��K�`�Q��QP������������i�e���*���,������sc�{���S^��_���D�pH�L8�.MoP%h�!���T�-��5�Nx~#�)�>YC�V��k&����hEd3����+��<"9X�q�\M���\�j#��[i	W���q�������EB?>�	��wK����o�ZJ)�~�/�������A��Z���8d�y��� ��;���]�J�z���Q����|����t��F��)�#Iq%�U�k2������R���<N7]����H�c�"�k.�U�����&F�q��+����b+Au6���	r��r�yo=BH�$(6`�N@�f	U �zF����Yv\?�@�=��K��y|�A�@��'�B��0��df����
�,*�t�������S�ZZ.��X�y��v~!S�4��s"��L�VF�C�C��ca�q6*�[�|�oH^��0�Q�gEB	{�9�����C����F�$�_9Y#u�M�,z�V�C ���I�����b�'a�!kY���-J���0$�#���3��������'���!���;k��~�B�����������Vn��Fj��}�q����Y����l'���9g���`G��N���1@�~����MST�M����	�b��&�'����lP�d���i���J�^�B8n������/���
bsC�4!$����HO ��_T�Wi\<5��������uW�M��&�E=y{���>I�r_���c9�#���K.z�Q������z����FZ���Y~Et[|��2�'<D:����C�����D�A@-"�Rm����bL`ep����)0Q4y������Q������z_�J�B��8��/�)�a�=�|�
�3�|C��&��
�b���I.Hk@i#���������iudMC#��{{0Ke

B�7��T�l�@���J}�G<�L����2������Z�*H����#UK����-��)E��^��C��nm[$�}���>�B�GD���������)�	aA-*�*
�,8����;��:��Y�� i�-$�1'p.q�Nx�.Z��n�)���N�$������:v9���M5��x���5����-PA��������������H_D�y��S�a�81��K�d
D�.U��p��Z 1T$������4
��j'Ojo:4L|���F2Q
����F���m�����#���v���56y��*[5�*b���_&�-����<�����}�]�{b��5�����W��G��-F��
��u�0�+���-th#������[��[��y'\��&Q�".�+��K��`����[1�Z���p��*����k5@�(�kP��.����k��
�*@���#D��^),-FVQC6H�6����L��6��q�ry�������]hU+����Pu��o:S���@%�~MV�5V��/�*C�%���
�U+]=t�\>i'���~`��^�aJ+��R�2�
p��n],7I��V������S����drn�|��5���n��{�D���8g�m���5�\��7�x��#����A2��y�B�'��7���x�;�[�W)gS��~�5N"i����`|�}�F�I�������1��� �Ew��7����	��@QEj����wn/�WKy�N�r�^��s-�Un�^�_�e$�s��;�(1G9*=�dd�a]��������^l:�l�v^���uqr|������o1�4'>N��RL��n��7�X��
�L��]x.���m��1��K�����%C��������j^`6q2�bGf�L�,�����ez��_���
F�2XN�W�v��TG#t{/*����/x'�	�gF���q��KH��i��f��p�~�W�����uch6i�S9���"��G]1��y��k���[�^�Ff���@ ����R���,c�hV��E�6v8�#q������[E���������,�������jp�;��t�CQC��l�N]����j�6��4������=������XY�������p�>��������>cTK��^"�S��>�^��9�}��^���j`?au�g��C4���nG>����f�8�x��������-�B9�wO�7����-X@}��#����}#�5�3�t��������5wm����%7r����mWg������U�'������}������y�Vw��no�9�v��m��S�����:�q�_��l_��<\�����Z�k����_�}�G}�3��^(���"�������W/^4�Z(��
F��x��y���/������'�����y���G����N�o#��wKW;c{#+[g��y�����V^�z��ij�@����0�x��oV���:���X�[8�{������Y�`�"�x���_�Z�n����?��U�/Ay����/��m�%�~.��f�v�����z��{;O����������������3���������w+Xt�O+��F.f�'V$�����g!�7+���
����������R��ddbbfk�����D�o����%g#���'�w;t��9B@#{3��;z����
|���������X�Y�9s|�r�3:s������B�'~0�-�)�1;#{+s3gv'S��:��s.|�"|$���|�����g5	6Y%)�������G0���9��&�.f��c 2�?��f���Cx���o������������{���

#56

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Tomas Vondra (#55)

Re: [HACKERS] [PATCH] Incremental sort

On Thu, Mar 8, 2018 at 2:49 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

OK, the revised patch works fine - I've done a lot of testing and
benchmarking, and not a single segfault or any other crash.

Regarding the benchmarks, I generally used queries of the form

SELECT * FROM (SELECT * FROM t ORDER BY a) foo ORDER BY a,b

with the first sort done in various ways:

* regular Sort node
* indexes with Index Scan
* indexes with Index Only Scan

and all these three options with and without LIMIT (the limit was set to
1% of the source table).

I've also varied parallelism (max_parallel_workers_per_gather was set to
either 0 or 2), work_mem (from 4MB to 256MB) and data set size (tables
from 1000 rows to 10M rows).

All of this may seem like an overkill, but I've found a couple of
regressions thanks to that.

The full scripts and results are available here:

https://github.com/tvondra/incremental-sort-tests

The queries actually executed are a bit more complicated, to eliminate
overhead due to data transfer to client etc. The same approach was used
in the other sorting benchmarks we've done in the past.

I'm attaching results for two scales - 10k and 10M rows, preprocessed
into .ods format. I haven't looked at the other scales yet, but I don't
expect any surprises there.

Each .ods file contains raw data for one of the tests (matching the .sh
script filename), pivot table, and comparison of durations with and
without the incremental sort.

In general, I think the results look pretty impressive. Almost all the
comparisons are green, which means "faster than master" - usually by
tens of percent (without limit), or by up to ~95% (with LIMIT).

There are a couple of regressions in two cases sort-indexes and
sort-indexes-ios.

Oh the small dataset this seems to be related to the number of groups
(essentially, number of distinct values in a column). My assumption is
that there is some additional overhead when "switching" between the
groups, and with many groups it's significant enough to affect results
on these tiny tables (where master only takes ~3ms to do the sort). The
slowdown seems to be

On the large data set it seems to be somehow related to both work_mem
and number of groups, but I didn't have time to investigate that yet
(there are explain analyze plans in the results, so feel free to look).

In general, I think this looks really nice. It's certainly awesome with
the LIMIT case, as it allows us to leverage indexes on a subset of the
ORDER BY columns.

Thank you very much for testing and benchmarking. I'll investigate
the regressions you found.

Now, there's a caveat in those tests - the data set is synthetic and
perfectly random, i.e. all groups equally likely, no correlations or
anything like that.

I wonder what is the "worst case" scenario, i.e. how to construct a data
set with particularly bad behavior of the incremental sort.

I think that depends on the reason of bad behavior of incremental sort.
For example, our quick sort implementation behaves very good on
presorted data. But, incremental sort appears to be not so good in
this case as Heikki showed upthread. That prompted me to test
presorted datasets (which appeared to be "worst case") more intensively.
But I suspect that regressions you found have another reason, and
correspondingly "worst case" would be also different.
When I'll investigate the reason of regressions, I'll try to construct
"worst case" as well.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#57

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Alexander Korotkov (#56)

Re: [HACKERS] [PATCH] Incremental sort

On Sat, Mar 10, 2018 at 6:42 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Thu, Mar 8, 2018 at 2:49 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com

wrote:

Thank you very much for testing and benchmarking. I'll investigate
the regressions you found.

Now, there's a caveat in those tests - the data set is synthetic and
perfectly random, i.e. all groups equally likely, no correlations or
anything like that.

I wonder what is the "worst case" scenario, i.e. how to construct a data
set with particularly bad behavior of the incremental sort.

I think that depends on the reason of bad behavior of incremental sort.
For example, our quick sort implementation behaves very good on
presorted data. But, incremental sort appears to be not so good in
this case as Heikki showed upthread. That prompted me to test
presorted datasets (which appeared to be "worst case") more intensively.
But I suspect that regressions you found have another reason, and
correspondingly "worst case" would be also different.
When I'll investigate the reason of regressions, I'll try to construct
"worst case" as well.

After some investigation of benchmark results, I found 2 sources of
regressions of incremental sort.

*Case 1: Underlying node scan lose is bigger than incremental sort win*

===== 33 [Wed Mar 7 10:14:14 CET 2018] scale:10000000 groups:10
work_mem:64MB incremental:on max_workers:0 =====
SELECT * FROM s_1 ORDER BY a, b
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1588080.84..1588080.84 rows=1 width=20) (actual
time=5874.527..5874.527 rows=0 loops=1)
-> Incremental Sort (cost=119371.51..1488081.45 rows=9999939 width=20)
(actual time=202.842..5653.224 rows=10000000 loops=1)
Sort Key: s_1.a, s_1.b
Presorted Key: s_1.a
Sort Method: external merge Disk: 29408kB
Sort Groups: 11
-> Index Scan using s_1_a_idx on s_1 (cost=0.43..323385.52
rows=9999939 width=20) (actual time=0.051..1494.105 rows=10000000 loops=1)
Planning time: 0.269 ms
Execution time: 5877.367 ms
(9 rows)

===== 37 [Wed Mar 7 10:15:51 CET 2018] scale:10000000 groups:10
work_mem:64MB incremental:off max_workers:0 =====
SELECT * FROM s_1 ORDER BY a, b
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1656439.93..1656439.93 rows=1 width=20) (actual
time=4741.716..4741.716 rows=0 loops=1)
-> Sort (cost=1531440.69..1556440.54 rows=9999939 width=20) (actual
time=3522.156..4519.278 rows=10000000 loops=1)
Sort Key: s_1.a, s_1.b
Sort Method: external merge Disk: 293648kB
-> Seq Scan on s_1 (cost=0.00..163694.39 rows=9999939 width=20)
(actual time=0.021..650.322 rows=10000000 loops=1)
Planning time: 0.249 ms
Execution time: 4777.088 ms
(7 rows)

In this case optimizer have decided that "Index Scan + Incremental Sort"
would be
cheaper than "Seq Scan + Sort". But it appears that the amount of time we
loose by
selecting Index Scan over Seq Scan is bigger than amount of time we win by
selecting Incremental Sort over Sort. I would note that regular Sort
consumes
about 10X more disk space. I bet that all this space has fit to OS cache
of test
machine. But optimizer did expect actual IO to take place in this case.
This
has lead actual time to be inadequate the costing.

*Case 2: Underlying node is not parallelyzed*

===== 178 [Wed Mar 7 11:18:53 CET 2018] scale:10000000 groups:100
work_mem:8MB incremental:on max_workers:2 =====
SELECT * FROM s_2 ORDER BY a, b, c
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1179047.88..1179047.88 rows=1 width=20) (actual
time=4819.999..4819.999 rows=0 loops=1)
-> Incremental Sort (cost=89.04..1079047.34 rows=10000054 width=20)
(actual time=0.203..4603.197 rows=10000000 loops=1)
Sort Key: s_2.a, s_2.b, s_2.c
Presorted Key: s_2.a, s_2.b
Sort Method: quicksort Memory: 135kB
Sort Groups: 10201
-> Index Scan using s_2_a_b_idx on s_2 (cost=0.43..406985.62
rows=10000054 width=20) (actual time=0.052..1461.177 rows=10000000 loops=1)
Planning time: 0.313 ms
Execution time: 4820.037 ms
(9 rows)

===== 182 [Wed Mar 7 11:20:11 CET 2018] scale:10000000 groups:100
work_mem:8MB incremental:off max_workers:2 =====
SELECT * FROM s_2 ORDER BY a, b, c
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1705580.76..1705580.76 rows=1 width=20) (actual
time=3985.818..3985.818 rows=0 loops=1)
-> Gather Merge (cost=649951.66..1622246.98 rows=8333378 width=20)
(actual time=1782.354..3750.868 rows=10000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=648951.64..659368.36 rows=4166689 width=20)
(actual time=1778.362..2091.253 rows=3333333 loops=3)
Sort Key: s_2.a, s_2.b, s_2.c
Sort Method: external merge Disk: 99136kB
Worker 0: Sort Method: external merge Disk: 96984kB
Worker 1: Sort Method: external merge Disk: 97496kB
-> Parallel Seq Scan on s_2 (cost=0.00..105361.89
rows=4166689 width=20) (actual time=0.022..233.640 rows=3333333 loops=3)
Planning time: 0.265 ms
Execution time: 4007.591 ms
(12 rows)

The situation is similar to case #1 except that in the pair "Seq Scan +
Sort" Sort also
gets paralellyzed. In the same way as in previous case, disk writes/reads
during
external sort are overestimated, because they actually use OS cache.
I would also say that it's not necessary wrong decision of optimizer,
because
doing this work in single backend may consume less resources despite being
overall slower.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#58

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Alexander Korotkov (#57)

Re: [HACKERS] [PATCH] Incremental sort

On 03/10/2018 06:05 PM, Alexander Korotkov wrote:

On Sat, Mar 10, 2018 at 6:42 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru <mailto:a.korotkov@postgrespro.ru>> wrote:

...

After some investigation of benchmark results, I found 2 sources of
regressions of incremental sort.

*Case 1: Underlying node scan lose is bigger than incremental sort win*

===== 33 [Wed Mar 7 10:14:14 CET 2018] scale:10000000 groups:10
work_mem:64MB incremental:on max_workers:0 =====
SELECT * FROM s_1 ORDER BY a, b
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1588080.84..1588080.84 rows=1 width=20) (actual
time=5874.527..5874.527 rows=0 loops=1)
-> Incremental Sort (cost=119371.51..1488081.45 rows=9999939
width=20) (actual time=202.842..5653.224 rows=10000000 loops=1)
Sort Key: s_1.a, s_1.b
Presorted Key: s_1.a
Sort Method: external merge Disk: 29408kB
Sort Groups: 11
-> Index Scan using s_1_a_idx on s_1 (cost=0.43..323385.52
rows=9999939 width=20) (actual time=0.051..1494.105 rows=10000000 loops=1)
Planning time: 0.269 ms
Execution time: 5877.367 ms
(9 rows)

===== 37 [Wed Mar 7 10:15:51 CET 2018] scale:10000000 groups:10
work_mem:64MB incremental:off max_workers:0 =====
SELECT * FROM s_1 ORDER BY a, b
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1656439.93..1656439.93 rows=1 width=20) (actual
time=4741.716..4741.716 rows=0 loops=1)
-> Sort (cost=1531440.69..1556440.54 rows=9999939 width=20) (actual
time=3522.156..4519.278 rows=10000000 loops=1)
Sort Key: s_1.a, s_1.b
Sort Method: external merge Disk: 293648kB
-> Seq Scan on s_1 (cost=0.00..163694.39 rows=9999939
width=20) (actual time=0.021..650.322 rows=10000000 loops=1)
Planning time: 0.249 ms
Execution time: 4777.088 ms
(7 rows)

In this case optimizer have decided that "Index Scan + Incremental
Sort" would be cheaper than "Seq Scan + Sort". But it appears that
the amount of time we loose by selecting Index Scan over Seq Scan is
bigger than amount of time we win by selecting Incremental Sort over
Sort. I would note that regular Sort consumes about 10X more disk
space. I bet that all this space has fit to OS cache of test
machine. But optimizer did expect actual IO to take place in this
case. This has lead actual time to be inadequate the costing.

Yes, you're right the temporary file(s) likely fit into RAM in this test
(and even if they did not, the storage system is pretty good).

*Case 2: Underlying node is not parallelyzed*

===== 178 [Wed Mar 7 11:18:53 CET 2018] scale:10000000 groups:100
work_mem:8MB incremental:on max_workers:2 =====
SELECT * FROM s_2 ORDER BY a, b, c

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1179047.88..1179047.88 rows=1 width=20) (actual
time=4819.999..4819.999 rows=0 loops=1)
-> Incremental Sort (cost=89.04..1079047.34 rows=10000054 width=20)
(actual time=0.203..4603.197 rows=10000000 loops=1)
Sort Key: s_2.a, s_2.b, s_2.c
Presorted Key: s_2.a, s_2.b
Sort Method: quicksort Memory: 135kB
Sort Groups: 10201
-> Index Scan using s_2_a_b_idx on s_2 (cost=0.43..406985.62
rows=10000054 width=20) (actual time=0.052..1461.177 rows=10000000 loops=1)
Planning time: 0.313 ms
Execution time: 4820.037 ms
(9 rows)

===== 182 [Wed Mar 7 11:20:11 CET 2018] scale:10000000 groups:100
work_mem:8MB incremental:off max_workers:2 =====
SELECT * FROM s_2 ORDER BY a, b, c
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1705580.76..1705580.76 rows=1 width=20) (actual
time=3985.818..3985.818 rows=0 loops=1)
-> Gather Merge (cost=649951.66..1622246.98 rows=8333378 width=20)
(actual time=1782.354..3750.868 rows=10000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=648951.64..659368.36 rows=4166689 width=20)
(actual time=1778.362..2091.253 rows=3333333 loops=3)
Sort Key: s_2.a, s_2.b, s_2.c
Sort Method: external merge Disk: 99136kB
Worker 0: Sort Method: external merge Disk: 96984kB
Worker 1: Sort Method: external merge Disk: 97496kB
-> Parallel Seq Scan on s_2 (cost=0.00..105361.89
rows=4166689 width=20) (actual time=0.022..233.640 rows=3333333 loops=3)
Planning time: 0.265 ms
Execution time: 4007.591 ms
(12 rows)

The situation is similar to case #1 except that in the pair "Seq Scan
+ Sort" Sort also gets paralellyzed. In the same way as in previous
case, disk writes/reads during external sort are overestimated,
because they actually use OS cache. I would also say that it's not
necessary wrong decision of optimizer, because doing this work in
single backend may consume less resources despite being overall
slower.

Yes, that seems like a likely explanation too.

I agree those don't seem like an issue in the Incremental Sort patch,
but like a more generic costing problems.

Thanks for looking into the benchmark results.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#59

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Tomas Vondra (#58)

Re: [HACKERS] [PATCH] Incremental sort

On Fri, Mar 16, 2018 at 5:12 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

I agree those don't seem like an issue in the Incremental Sort patch,
but like a more generic costing problems.

Yes, I think so too.
Do you think we can mark this patch RFC assuming that it have already got
pretty
much of review previously.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#60

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Alexander Korotkov (#59)

Re: [HACKERS] [PATCH] Incremental sort

On 03/16/2018 09:47 AM, Alexander Korotkov wrote:

On Fri, Mar 16, 2018 at 5:12 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

I agree those don't seem like an issue in the Incremental Sort patch,
but like a more generic costing problems.

Yes, I think so too.

I wonder if we could make the costing a bit more pessimistic, to make
these loses less likely, while still keeping the main wins (particularly
for the LIMIT queries). But that seems a bit like a lost case, I guess.

Do you think we can mark this patch RFC assuming that it have
already got pretty much of review previously.

Actually, I was going to propose to switch it to RFC, so I've just done
that. I think the patch is clearly ready for a committer to take a
closer look. I really like this improvement.

I'm going to rerun the tests, but that's mostly because I'm interested
if the change from i++ to i-- in cmpSortPresortedCols makes a measurable
difference. I don't expect to find any issues, so why wait with the RFC?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#61

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Tomas Vondra (#60)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

Hi!

Revised patch is attached. It's rebased to the last master.

On Fri, Mar 16, 2018 at 3:55 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 03/16/2018 09:47 AM, Alexander Korotkov wrote:

On Fri, Mar 16, 2018 at 5:12 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>

wrote:

I agree those don't seem like an issue in the Incremental Sort patch,
but like a more generic costing problems.

Yes, I think so too.

I wonder if we could make the costing a bit more pessimistic, to make
these loses less likely, while still keeping the main wins (particularly
for the LIMIT queries). But that seems a bit like a lost case, I guess.

Making costing more pessimistic makes sense. Revised patch does
it in quite rough way: volumes of groups in incremental sort are multiplied
by 1.5. That makes one query in regression tests to fallback to fullsort
from incremental sort. Could you test it? If this will shorten number of
cases where incremental sort causes regression, then it might be an
acceptable way to do more pessimistic costing.

Do you think we can mark this patch RFC assuming that it have

already got pretty much of review previously.

Actually, I was going to propose to switch it to RFC, so I've just done
that. I think the patch is clearly ready for a committer to take a
closer look. I really like this improvement.

I'm going to rerun the tests, but that's mostly because I'm interested
if the change from i++ to i-- in cmpSortPresortedCols makes a measurable
difference. I don't expect to find any issues, so why wait with the RFC?

Good, thanks.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-18.patchapplication/octet-stream; name=incremental-sort-18.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index a2b13846e0..3eab376391 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4d2e43c9f0..729086ee29 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -514,7 +514,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f18d2b3353..4c2982a627 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3692,6 +3692,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c38d178cd9..02df5dfd59 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1005,6 +1009,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1615,6 +1622,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -1942,14 +1955,37 @@ static void
 show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 {
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+	int			presortedCols;
+
+	if (IsA(plan, IncrementalSort))
+		presortedCols = ((IncrementalSort *) plan)->presortedCols;
+	else
+		presortedCols = 0;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, presortedCols, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -1960,7 +1996,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -1984,7 +2020,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2053,7 +2089,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2110,7 +2146,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2123,13 +2159,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2169,9 +2206,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2379,6 +2420,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->groupsCount);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->groupsCount, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		groupsCount;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, groupsCount, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 14b0b89463..6c597c5b20 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -916,6 +925,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -976,6 +986,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1225,6 +1238,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 1b1334006f..77013909a8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -373,7 +373,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
-												  NULL, false);
+												  NULL, false, false);
 	}
 
 	aggstate->current_phase = newphase;
@@ -460,7 +460,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, false);
+									 work_mem, NULL, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1f5e41f95a
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,631 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is a specially optimized kind of multikey sort used
+ *		when the input is already presorted by a prefix of the required keys
+ *		list.  Thus, when it's required to sort by (key1, key2 ... keyN) and
+ *		result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ *		where values of (key1, key2 ... keyM) are equal.
+ *
+ *		Consider the following example.  We have input tuples consisting from
+ *		two integers (x, y) already presorted by x, while it's required to
+ *		sort them by x and y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 10)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would sort by y following groups, which have
+ *		equal x, individually:
+ *			(1, 5) (1, 2)
+ *			(2, 10) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		following tuple set which is actually sorted by x and y.
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 10)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort is faster than full sort on large datasets.  But
+ *		the case of most huge benefit of incremental sort is queries with
+ *		LIMIT because incremental sort can return first tuples without reading
+ *		whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presortedKeys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presortedKeys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presortedKeys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check if first "presortedCols" sort values are equal.
+ */
+static bool
+cmpSortPresortedCols(IncrementalSortState *node, TupleTableSlot *a,
+															TupleTableSlot *b)
+{
+	int n, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	n = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	for (i = n - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presortedKeys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(a, attno, &isnullA);
+		datumB = slot_getattr(b, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presortedKeys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Copying of tuples to the node->grpPivotSlot introduces some overhead.  It's
+ * especially notable when groups are containing one or few tuples.  In order
+ * to cope this problem we don't copy pivot tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from sorted set if any.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * If first time through, read all tuples from outer plan and pass them to
+	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	/*
+	 * Initialize tuplesort module.
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "calling tuplesort_begin");
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->groupsCount++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* Put saved tuple to tuplesort if any */
+	if (!TupIsNull(node->grpPivotSlot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->grpPivotSlot);
+		ExecClearTuple(node->grpPivotSlot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/* Put next group of presorted data to the tuplesort */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Save last tuple in minimal group */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->grpPivotSlot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/* Iterate while presorted cols are the same as in saved tuple */
+			if (cmpSortPresortedCols(node, node->grpPivotSlot, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->grpPivotSlot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+															node->groupsCount;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->grpPivotSlot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->groupsCount = 0;
+	incrsortstate->presortedKeys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->grpPivotSlot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->grpPivotSlot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..457e774b3d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,9 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 3ad4da64aa..df0b49b8c5 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -920,6 +920,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -931,13 +949,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4833,6 +4867,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index fd80891954..e59fa0d7a1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -876,12 +876,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -903,6 +901,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3755,6 +3771,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 068db353d7..c50365c56a 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2066,12 +2066,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2080,6 +2081,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2647,6 +2674,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8735e29807..78d5d7e3bf 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3646,6 +3646,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 36b3dfabb8..2bd9968d95 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1614,6 +1615,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  *	  Determines and returns the cost of sorting a relation, including
  *	  the cost of reading the input data.
  *
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys.  In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys.  And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly.  Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
  * comparisons for t tuples.
@@ -1640,7 +1648,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * work that has to be done to prepare the inputs to the comparison operators.
  *
  * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
@@ -1656,19 +1666,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
+	Cost		startup_cost = input_startup_cost;
+	Cost		run_cost = 0,
+				rest_cost,
+				group_cost,
+				input_run_cost = input_total_cost - input_startup_cost;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
+	double		num_groups,
+				group_input_bytes,
+				group_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
 	if (!enable_sort)
 		startup_cost += disable_cost;
+	if (!enable_incrementalsort)
+		presorted_keys = 0;
 
 	path->rows = tuples;
 
@@ -1694,13 +1713,56 @@ cost_sort(Path *path, PlannerInfo *root,
 		output_bytes = input_bytes;
 	}
 
-	if (output_bytes > sort_mem_bytes)
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	if (presorted_keys > 0)
+	{
+		List	   *presortedExprs = NIL;
+		ListCell   *l;
+		int			i = 0;
+
+		/* Extract presorted keys as list of expressions */
+		foreach(l, pathkeys)
+		{
+			PathKey *key = (PathKey *)lfirst(l);
+			EquivalenceMember *member = (EquivalenceMember *)
+										linitial(key->pk_eclass->ec_members);
+
+			presortedExprs = lappend(presortedExprs, member->em_expr);
+
+			i++;
+			if (i >= presorted_keys)
+				break;
+		}
+
+		/* Estimate number of groups with equal presorted keys */
+		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+
+		/*
+		 * Estimate average cost of sorting of one group where presorted keys
+		 * are equal.  Incremental sort is sensitive to distribution of tuples
+		 * to the groups, where we're relying on quite rough assumptions.  Thus,
+		 * we're pessimistic about incremental sort performance and increase
+		 * its average group size by half.
+		 */
+		group_input_bytes = 1.5 * input_bytes / num_groups;
+		group_tuples = 1.5 * tuples / num_groups;
+	}
+	else
+	{
+		num_groups = 1.0;
+		group_input_bytes = input_bytes;
+		group_tuples = tuples;
+	}
+
+	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
-		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = input_bytes / sort_mem_bytes;
+		double		npages = ceil(group_input_bytes / BLCKSZ);
+		double		nruns = group_input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1710,7 +1772,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 
 		/* Disk costs */
 
@@ -1721,10 +1783,10 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		group_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
-	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1732,14 +1794,33 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
-		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		/*
+		 * We'll use plain quicksort on all the input tuples.  If it appears
+		 * that we expect less than two tuples per sort group then assume
+		 * logarithmic part of estimate to be 1.
+		 */
+		if (group_tuples >= 2.0)
+			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+		else
+			group_cost = comparison_cost * group_tuples;
 	}
 
+	/* Add per group cost of fetching tuples from input */
+	group_cost += input_run_cost / num_groups;
+
+	/*
+	 * We've to sort first group to start output from node. Sorting rest of
+	 * groups are required to return all the other tuples.
+	 */
+	startup_cost += group_cost;
+	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+	if (rest_cost > 0.0)
+		run_cost += rest_cost;
+
 	/*
 	 * Also charge a small amount (arbitrarily set equal to operator cost) per
 	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
@@ -1750,6 +1831,20 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	run_cost += cpu_operator_cost * tuples;
 
+	/* Extra costs of incremental sort */
+	if (presorted_keys > 0)
+	{
+		/*
+		 * In incremental sort case we also have to cost the detection of
+		 * sort groups.  This turns out to be one extra copy and comparison
+		 * per tuple.
+		 */
+		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+		/* Cost of per group tuplesort reset */
+		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+	}
+
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
@@ -2727,6 +2822,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->pathtarget->width,
@@ -2753,6 +2850,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->pathtarget->width,
@@ -2989,18 +3088,17 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
 	 * inner path is to be used directly (without sorting) and it doesn't
 	 * support mark/restore.
 	 *
-	 * Since the inner side must be ordered, and only Sorts and IndexScans can
-	 * create order to begin with, and they both support mark/restore, you
-	 * might think there's no problem --- but you'd be wrong.  Nestloop and
-	 * merge joins can *preserve* the order of their inputs, so they can be
-	 * selected as the input of a mergejoin, and they don't support
-	 * mark/restore at present.
+	 * Sorts and IndexScans support mark/restore, but IncrementalSorts don't.
+	 * Also Nestloop and merge joins can *preserve* the order of their inputs,
+	 * so they can be selected as the input of a mergejoin, and they don't
+	 * support mark/restore at present.
 	 *
 	 * We don't test the value of enable_material here, because
 	 * materialization is required for correctness in this case, and turning
 	 * it off does not entitle us to deliver an invalid plan.
 	 */
-	else if (innersortkeys == NIL &&
+	else if ((innersortkeys == NIL ||
+			  pathkeys_common(innersortkeys, inner_path->pathkeys) > 0) &&
 			 !ExecSupportsMarkRestore(inner_path))
 		path->materialize_inner = true;
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..869c7c0b16 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,7 @@ compare_pathkeys(List *keys1, List *keys2)
 	return PATHKEYS_EQUAL;
 }
 
+
 /*
  * pathkeys_contained_in
  *	  Common special case of compare_pathkeys: we just want to know
@@ -327,6 +330,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL); 
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1580,26 +1628,45 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
  */
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 {
-	if (root->query_pathkeys == NIL)
+	int	n_common_pathkeys;
+
+	if (query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	if (pathkeys_common_contained_in(query_pathkeys, pathkeys, &n_common_pathkeys))
 	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		/* Full match of pathkeys: always useful */
+		return n_common_pathkeys;
+	}
+	else
+	{
+		if (enable_incrementalsort)
+		{
+			/*
+			 * Return the number of path keys in common, or 0 if there are none.
+			 * Any leading common pathkeys could be useful for ordering because
+			 * we can use the incremental sort.
+			 */
+			return n_common_pathkeys;
+		}
+		else
+		{
+			/*
+			 * When incremental sort is disabled, pathkeys are useful only when
+			 * they do contain all the query pathkeys.
+			 */
+			return 0;
+		}
 	}
-
-	return 0;					/* path ordering not useful */
 }
 
 /*
@@ -1615,7 +1682,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
 	int			nuseful2;
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
-	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9ae1bf31d5..30b91bd5bc 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree);
+						 Plan *lefttree,
+						 int presortedCols);
 static Material *make_material(Plan *lefttree);
 static WindowAgg *make_windowagg(List *tlist, Index winref,
 			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -443,6 +444,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1128,6 +1130,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		Oid		   *sortOperators;
 		Oid		   *collations;
 		bool	   *nullsFirst;
+		int			n_common_pathkeys;
 
 		/* Build the child plan */
 		/* Must insist that all children return the same tlist */
@@ -1162,9 +1165,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (!pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+										  &n_common_pathkeys))
 		{
 			Sort	   *sort = make_sort(subplan, numsortkeys,
+										 n_common_pathkeys,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1514,6 +1519,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 	Plan	   *subplan;
 	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	int			n_common_pathkeys;
 
 	/* As with Gather, it's best to project away columns in the workers. */
 	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1543,12 +1549,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+	if (!pathkeys_common_contained_in(pathkeys, best_path->subpath->pathkeys,
+									  &n_common_pathkeys))
+	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 n_common_pathkeys,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
+	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
@@ -1661,6 +1671,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1670,7 +1681,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
-	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->presortedCols;
+	else
+		n_common_pathkeys = 0;
+
+	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+								   NULL, n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -1914,7 +1931,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
 											 new_grpColIdx,
-											 subplan);
+											 subplan,
+											 0);
 			}
 
 			if (!rollup->is_hashed)
@@ -3862,10 +3880,15 @@ create_mergejoin_plan(PlannerInfo *root,
 	 */
 	if (best_path->outersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		outer_relids = outer_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
-												   best_path->outersortkeys,
-												   outer_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+									best_path->jpath.outerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+									   outer_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3876,10 +3899,15 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	if (best_path->innersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		inner_relids = inner_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
-												   best_path->innersortkeys,
-												   inner_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+									best_path->jpath.innerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+									   inner_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -4934,8 +4962,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
+	int			presorted_cols = 0;
+
+	if (IsA(plan, IncrementalSort))
+		presorted_cols = ((IncrementalSort *) plan)->presortedCols;
 
-	cost_sort(&sort_path, root, NIL,
+	cost_sort(&sort_path, root, NIL, presorted_cols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -5526,13 +5559,31 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	/* Always use regular sort node when enable_incrementalsort = false */
+	if (!enable_incrementalsort)
+		presortedCols = 0;
+
+	if (presortedCols == 0)
+	{
+		node = makeNode(Sort);
+	}
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->presortedCols = presortedCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5865,9 +5916,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5887,7 +5940,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5930,7 +5983,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5951,7 +6004,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 static Sort *
 make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree)
+						 Plan *lefttree,
+						 int presortedCols)
 {
 	List	   *sub_tlist = lefttree->targetlist;
 	ListCell   *l;
@@ -5984,7 +6038,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6649,6 +6703,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
 #include "parser/parse_clause.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 9c4a1baf5f..300fbc27e2 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4748,13 +4748,13 @@ create_ordered_paths(PlannerInfo *root,
 	foreach(lc, input_rel->pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
-		bool		is_sorted;
+		int			n_useful_pathkeys;
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+														 path->pathkeys);
+		if (path == cheapest_input_path || n_useful_pathkeys > 0)
 		{
-			if (!is_sorted)
+			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -5886,8 +5886,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->reltarget->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
@@ -6123,14 +6124,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
-			bool		is_sorted;
+			int			n_useful_pathkeys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
-			if (path == cheapest_path || is_sorted)
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+									root->group_pathkeys, path->pathkeys);
+			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
-				if (!is_sorted)
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -6192,21 +6193,24 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, partially_grouped_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			int			n_useful_pathkeys;
 
 			/*
 			 * Insert a Sort node, if required.  But there's no point in
-			 * sorting anything but the cheapest path.
+			 * non-incremental sorting anything but the cheapest path.
 			 */
-			if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
-			{
-				if (path != partially_grouped_rel->cheapest_total_path)
-					continue;
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+			if (n_useful_pathkeys == 0 &&
+				path != partially_grouped_rel->cheapest_total_path)
+				continue;
+
+			if (n_useful_pathkeys < list_length(root->group_pathkeys))
 				path = (Path *) create_sort_path(root,
 												 grouped_rel,
 												 path,
 												 root->group_pathkeys,
 												 -1.0);
-			}
 
 			if (parse->hasAggs)
 				add_path(grouped_rel, (Path *)
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 83008d7661..313cad266f 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2795,6 +2795,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index f087369f75..7cc11c4e3a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1021,7 +1021,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_path->startup_cost;
 	sorted_p.total_cost = input_path->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0, 
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_path->rows, input_path->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fe3b4582d4..aa154b8905 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
 }
 
 /*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
  *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
  *	  or more expensive than path2 for fetching the specified fraction
  *	  of the total tuples.
@@ -1362,12 +1362,14 @@ create_merge_append_path(PlannerInfo *root,
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
+		int			n_common_pathkeys;
 
 		pathnode->path.rows += subpath->rows;
 		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
 			subpath->parallel_safe;
 
-		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+										 &n_common_pathkeys))
 		{
 			/* Subpath is adequately ordered, we won't need to sort it */
 			input_startup_cost += subpath->startup_cost;
@@ -1381,6 +1383,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  n_common_pathkeys,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->pathtarget->width,
@@ -1628,7 +1632,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  subpath->pathtarget->width,
@@ -1721,6 +1726,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	GatherMergePath *pathnode = makeNode(GatherMergePath);
 	Cost		input_startup_cost = 0;
 	Cost		input_total_cost = 0;
+	int			n_common_pathkeys;
 
 	Assert(subpath->parallel_safe);
 	Assert(pathkeys);
@@ -1737,7 +1743,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.pathtarget = target ? target : rel->reltarget;
 	pathnode->path.rows += subpath->rows;
 
-	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+	if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys, &n_common_pathkeys))
 	{
 		/* Subpath is adequately ordered, we won't need to sort it */
 		input_startup_cost += subpath->startup_cost;
@@ -1751,6 +1757,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		cost_sort(&sort_path,
 				  root,
 				  pathkeys,
+				  n_common_pathkeys,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  subpath->rows,
 				  subpath->pathtarget->width,
@@ -2610,9 +2618,35 @@ create_sort_path(PlannerInfo *root,
 				 List *pathkeys,
 				 double limit_tuples)
 {
-	SortPath   *pathnode = makeNode(SortPath);
+	SortPath   *pathnode;
+	int			n_common_pathkeys;
+
+	/*
+	 * Use incremental sort when it's enabled and there are common pathkeys,
+	 * use regular sort otherwise.
+	 */
+	if (enable_incrementalsort)
+		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+	else
+		n_common_pathkeys = 0;
+
+	if (n_common_pathkeys == 0)
+	{
+		pathnode = makeNode(SortPath);
+		pathnode->path.pathtype = T_Sort;
+	}
+	else
+	{
+		IncrementalSortPath   *incpathnode;
+
+		incpathnode = makeNode(IncrementalSortPath);
+		pathnode = &incpathnode->spath;
+		pathnode->path.pathtype = T_IncrementalSort;
+		incpathnode->presortedCols = n_common_pathkeys;
+	}
+
+	Assert(n_common_pathkeys < list_length(pathkeys));
 
-	pathnode->path.pathtype = T_Sort;
 	pathnode->path.parent = rel;
 	/* Sort doesn't project, so use source path's pathtarget */
 	pathnode->path.pathtarget = subpath->pathtarget;
@@ -2626,7 +2660,9 @@ create_sort_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 
-	cost_sort(&pathnode->path, root, pathkeys,
+	cost_sort(&pathnode->path, root,
+			  pathkeys, n_common_pathkeys,
+			  subpath->startup_cost,
 			  subpath->total_cost,
 			  subpath->rows,
 			  subpath->pathtarget->width,
@@ -2938,7 +2974,8 @@ create_groupingsets_path(PlannerInfo *root,
 			else
 			{
 				/* Account for cost of sort, but don't charge input cost again */
-				cost_sort(&sort_path, root, NIL,
+				cost_sort(&sort_path, root, NIL, 0,
+						  0.0,
 						  0.0,
 						  subpath->rows,
 						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 50b34fcbc6..0b5ce4be45 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -295,7 +295,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortNullsFirsts,
 												   work_mem,
 												   NULL,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index bf240aa9c5..b694a5828d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3716,6 +3716,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 	return numdistinct;
 }
 
+/*
+ * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+ * 							  divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into.  Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+	ListCell   *l;
+	List	   *groupExprs = NIL;
+	double	   *result;
+	int			i;
+
+	/*
+	 * Get number of groups for each prefix of pathkeys.
+	 */
+	i = 0;
+	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+							linitial(key->pk_eclass->ec_members);
+
+		groupExprs = lappend(groupExprs, member->em_expr);
+
+		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+		i++;
+	}
+
+	return result;
+}
+
 /*
  * Estimate hash bucket statistics when the specified expression is used
  * as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7a7ac479c1..8862372610 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -859,6 +859,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 041bdc2fa7..26263ab5e6 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,9 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+#define INITAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +246,13 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +657,9 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +695,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +705,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +737,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +762,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +771,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -807,14 +828,15 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate, bool randomAccess)
+					 int workMem, SortCoordinate coordinate,
+					 bool randomAccess, bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -857,7 +879,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -890,7 +912,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1007,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1086,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1129,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1246,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_free
  *
- *	Release resources and clean up.
- *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1312,110 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	if (delete)
+	{
+		MemoryContextDelete(state->maincontext);
+	}
+	else
+	{
+		MemoryContextResetOnly(state->sortcontext);
+		MemoryContextResetOnly(state->tuplecontext);
+	}
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/* XXX */
+	if (spaceUsedOnDisk > state->maxSpaceOnDisk ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state, false);
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2589,8 +2710,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2640,7 +2760,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3137,18 +3258,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d9e591802f..b698a9e4ad 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1765,6 +1765,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1793,6 +1807,45 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	PresortedKeyData *presortedKeys;	/* keys, dataset is presorted by */
+	int64		groupsCount;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *grpPivotSlot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 74b094a9c3..133bb17bdc 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2e19eae68..13d9a75b50 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -751,6 +751,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index d576aa7350..5b0c63add9 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1519,6 +1519,16 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 132e35551b..00f0205be4 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 						 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 94f9bb2b57..597c5052a9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
@@ -229,6 +231,7 @@ extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
 extern List *trim_mergeclauses_for_inner_pathkeys(PlannerInfo *root,
 									 List *mergeclauses,
 									 List *pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
 extern List *truncate_useless_pathkeys(PlannerInfo *root,
 						  RelOptInfo *rel,
 						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
 extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
 					double input_rows, List **pgset);
 
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+										double tuples);
+
 extern void estimate_hash_bucket_stats(PlannerInfo *root,
 						   Node *hashkey, double nbuckets,
 						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..eb260dfd8b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -193,7 +193,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
 					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 bool randomAccess, bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index d768dc0215..ba645562a8 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE:  drop cascades to table matest1
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
  {3,7,8,10,13,13,16,18,19,22}
 (3 rows)
 
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Merge Append
+   Sort Key: tenk1.thousand, tenk1.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+   ->  Incremental Sort
+         Sort Key: tenk1_1.thousand, tenk1_1.thousand
+         Presorted Key: tenk1_1.thousand
+         ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Merge Append
+   Sort Key: a.thousand, a.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+   ->  Incremental Sort
+         Sort Key: b.unique2, b.unique2
+         Presorted Key: b.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 --
 -- Check handling of a constant-null CHECK constraint
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 84c6e9b5a4..78728f873a 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2347,18 +2347,21 @@ select count(*) from
   left join
   (select * from tenk1 y order by y.unique2) y
   on x.thousand = y.unique2 and x.twothousand = y.hundred and x.fivethous = y.unique2;
-                                    QUERY PLAN                                    
-----------------------------------------------------------------------------------
+                                                  QUERY PLAN                                                  
+--------------------------------------------------------------------------------------------------------------
  Aggregate
    ->  Merge Left Join
-         Merge Cond: (x.thousand = y.unique2)
-         Join Filter: ((x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
+         Merge Cond: ((x.thousand = y.unique2) AND (x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
          ->  Sort
                Sort Key: x.thousand, x.twothousand, x.fivethous
                ->  Seq Scan on tenk1 x
          ->  Materialize
-               ->  Index Scan using tenk1_unique2 on tenk1 y
-(9 rows)
+               ->  Incremental Sort
+                     Sort Key: y.unique2, y.hundred
+                     Presorted Key: y.unique2
+                     ->  Subquery Scan on y
+                           ->  Index Scan using tenk1_unique2 on tenk1 y_1
+(12 rows)
 
 select count(*) from
   (select * from tenk1 x order by x.thousand, x.twothousand, x.fivethous) x
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 759f7d9d59..f855214099 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge        | on
  enable_hashagg            | on
  enable_hashjoin           | on
+ enable_incrementalsort    | on
  enable_indexonlyscan      | on
  enable_indexscan          | on
  enable_material           | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan            | on
  enable_sort               | on
  enable_tidscan            | on
-(15 rows)
+(16 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 9397f72c13..cde4c2ee5a 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
@@ -607,9 +608,26 @@ SELECT
     ORDER BY f.i LIMIT 10)
 FROM generate_series(1, 3) g(i);
 
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 
 --
 -- Check handling of a constant-null CHECK constraint

#62

Darafei "Komяpa" Praliaskouski

me@komzpa.net

almost 8 years ago

In reply to: Alexander Korotkov (#61)

Re: [HACKERS] [PATCH] Incremental sort

Hi,

on a PostGIS system tuned for preferring parallelism heavily (
min_parallel_table_scan_size=10kB) we experience issues with QGIS table
discovery query with this patch:

Failing query is:
[local] gis@gis=# SELECT
l.f_table_name,l.f_table_schema,l.f_geometry_column,upper(l.type),l.srid,l.coord_dimension,c.relkind,obj_description(c.oid)
FROM geometry_columns l,pg_class c,pg_namespace n WHERE
c.relname=l.f_table_name AND l.f_table_schema=n.
nspname AND n.oid=c.relnamespace AND
has_schema_privilege(n.nspname,'usage') AND
has_table_privilege('"'||n.nspname||'"."'||c.relname||'"','select') AND
l.f_table_schema='public' ORDER BY n.nspname,c.relname,l.f_geometry_column;

ERROR: XX000: badly formatted node string "INCREMENTALSORT :startup_cost
37"...
CONTEXT: parallel worker
LOCATION: parseNodeString, readfuncs.c:2693
Time: 42,052 ms

Query plan:

QUERY PLAN

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sort (cost=38717.21..38717.22 rows=1 width=393)
Sort Key: c_1.relname, a.attname
-> Nested Loop (cost=36059.35..38717.20 rows=1 width=393)
-> Index Scan using pg_namespace_nspname_index on pg_namespace n
(cost=0.28..2.30 rows=1 width=68)
Index Cond: (nspname = 'public'::name)
Filter: has_schema_privilege((nspname)::text, 'usage'::text)
-> Nested Loop (cost=36059.08..38714.59 rows=1 width=407)
-> Nested Loop Left Join (cost=36058.65..38712.12 rows=1
width=334)
Join Filter: ((s_2.connamespace = n_1.oid) AND
(a.attnum = ANY (s_2.conkey)))
-> Nested Loop Left Join (cost=36058.51..38711.94
rows=1 width=298)
Join Filter: ((s_1.connamespace = n_1.oid) AND
(a.attnum = ANY (s_1.conkey)))
-> Nested Loop (cost=36058.38..38711.75 rows=1
width=252)
Join Filter: (a.atttypid = t.oid)
-> Gather Merge (cost=36057.95..38702.65
rows=444 width=256)
Workers Planned: 10
-> Merge Left Join
(cost=35057.76..37689.01 rows=44 width=256)
Merge Cond: ((n_1.oid =
s.connamespace) AND (c_1.oid = s.conrelid))
Join Filter: (a.attnum = ANY
(s.conkey))
-> Incremental Sort
(cost=37687.19..37687.30 rows=44 width=210)
Sort Key: n_1.oid, c_1.oid
Presorted Key: n_1.oid
-> Nested Loop
(cost=34837.25..37685.99 rows=44 width=210)
-> Merge Join
(cost=34836.82..34865.99 rows=9 width=136)
Merge Cond:
(c_1.relnamespace = n_1.oid)
-> Sort
(cost=34834.52..34849.05 rows=5814 width=72)
Sort
Key: c_1.relnamespace
->
Parallel Seq Scan on pg_class c_1 (cost=0.00..34470.99 rows=5814 width=72)

Filter: ((relname <> 'raster_columns'::name) AND (NOT
pg_is_other_temp_schema(relnamespace)) AND has_table_privilege(oid,
'SELECT'::text) AND (relkind = ANY ('{r,v,m,f,p}'::"char"[])))
-> Sort
(cost=2.30..2.31 rows=1 width=68)
Sort
Key: n_1.oid
->
Index Scan using pg_namespace_nspname_index on pg_namespace n_1
(cost=0.28..2.29 rows=1 width=68)

Index Cond: (nspname = 'public'::name)
-> Index Scan
using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.43..200.52
rows=11281 width=78)
Index Cond:
(attrelid = c_1.oid)
Filter: (NOT
attisdropped)
-> Sort (cost=1.35..1.35
rows=1 width=77)
Sort Key: s.connamespace,
s.conrelid
-> Seq Scan on
pg_constraint s (cost=0.00..1.34 rows=1 width=77)
Filter: (consrc ~~*
'%geometrytype(% = %'::text)
-> Materialize (cost=0.42..2.45 rows=1
width=4)
-> Index Scan using
pg_type_typname_nsp_index on pg_type t (cost=0.42..2.44 rows=1 width=4)
Index Cond: (typname =
'geometry'::name)
-> Index Scan using pg_constraint_conrelid_index
on pg_constraint s_1 (cost=0.14..0.16 rows=1 width=77)
Index Cond: (conrelid = c_1.oid)
Filter: (consrc ~~* '%ndims(% = %'::text)
-> Index Scan using pg_constraint_conrelid_index on
pg_constraint s_2 (cost=0.14..0.16 rows=1 width=77)
Index Cond: (conrelid = c_1.oid)
Filter: (consrc ~~* '%srid(% = %'::text)
-> Index Scan using pg_class_relname_nsp_index on pg_class
c (cost=0.42..2.46 rows=1 width=73)
Index Cond: ((relname = c_1.relname) AND (relnamespace
= n.oid))
Filter: has_table_privilege((((('"'::text ||
(n.nspname)::text) || '"."'::text) || (relname)::text) || '"'::text),
'select'::text)
(51 rows)

Darafei Praliaskouski,
GIS Engineer / Juno Minsk

#63

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Darafei "Komяpa" Praliaskouski (#62)

Re: [HACKERS] [PATCH] Incremental sort

Hi!

On Wed, Mar 21, 2018 at 2:30 PM, Darafei "Komяpa" Praliaskouski <
me@komzpa.net> wrote:

on a PostGIS system tuned for preferring parallelism heavily (
min_parallel_table_scan_size=10kB) we experience issues with QGIS table
discovery query with this patch:

Failing query is:
[local] gis@gis=# SELECT l.f_table_name,l.f_table_
schema,l.f_geometry_column,upper(l.type),l.srid,l.coord_
dimension,c.relkind,obj_description(c.oid) FROM geometry_columns
l,pg_class c,pg_namespace n WHERE c.relname=l.f_table_name AND
l.f_table_schema=n.
nspname AND n.oid=c.relnamespace AND has_schema_privilege(n.nspname,'usage')
AND has_table_privilege('"'||n.nspname||'"."'||c.relname||'"','select')
AND l.f_table_schema='public' ORDER BY n.nspname,c.relname,l.f_geometry_column;

ERROR: XX000: badly formatted node string "INCREMENTALSORT :startup_cost
37"...
CONTEXT: parallel worker
LOCATION: parseNodeString, readfuncs.c:2693
Time: 42,052 ms

Query plan:

QUERY PLAN

────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────
──────────────────
Sort (cost=38717.21..38717.22 rows=1 width=393)
Sort Key: c_1.relname, a.attname
-> Nested Loop (cost=36059.35..38717.20 rows=1 width=393)
-> Index Scan using pg_namespace_nspname_index on pg_namespace n
(cost=0.28..2.30 rows=1 width=68)
Index Cond: (nspname = 'public'::name)
Filter: has_schema_privilege((nspname)::text, 'usage'::text)
-> Nested Loop (cost=36059.08..38714.59 rows=1 width=407)
-> Nested Loop Left Join (cost=36058.65..38712.12 rows=1
width=334)
Join Filter: ((s_2.connamespace = n_1.oid) AND
(a.attnum = ANY (s_2.conkey)))
-> Nested Loop Left Join (cost=36058.51..38711.94
rows=1 width=298)
Join Filter: ((s_1.connamespace = n_1.oid) AND
(a.attnum = ANY (s_1.conkey)))
-> Nested Loop (cost=36058.38..38711.75 rows=1
width=252)
Join Filter: (a.atttypid = t.oid)
-> Gather Merge (cost=36057.95..38702.65
rows=444 width=256)
Workers Planned: 10
-> Merge Left Join
(cost=35057.76..37689.01 rows=44 width=256)
Merge Cond: ((n_1.oid =
s.connamespace) AND (c_1.oid = s.conrelid))
Join Filter: (a.attnum = ANY
(s.conkey))
-> Incremental Sort
(cost=37687.19..37687.30 rows=44 width=210)
Sort Key: n_1.oid,
c_1.oid
Presorted Key: n_1.oid
-> Nested Loop
(cost=34837.25..37685.99 rows=44 width=210)
-> Merge Join
(cost=34836.82..34865.99 rows=9 width=136)
Merge Cond:
(c_1.relnamespace = n_1.oid)
-> Sort
(cost=34834.52..34849.05 rows=5814 width=72)
Sort
Key: c_1.relnamespace
->
Parallel Seq Scan on pg_class c_1 (cost=0.00..34470.99 rows=5814 width=72)

Filter: ((relname <> 'raster_columns'::name) AND (NOT
pg_is_other_temp_schema(relnamespace)) AND has_table_privilege(oid,
'SELECT'::text) AND (relkind = ANY ('{r,v,m,f,p}'::"char"[])))
-> Sort
(cost=2.30..2.31 rows=1 width=68)
Sort
Key: n_1.oid
->
Index Scan using pg_namespace_nspname_index on pg_namespace n_1
(cost=0.28..2.29 rows=1 width=68)

Index Cond: (nspname = 'public'::name)
-> Index Scan
using pg_attribute_relid_attnum_index on pg_attribute a
(cost=0.43..200.52 rows=11281 width=78)
Index Cond:
(attrelid = c_1.oid)
Filter: (NOT
attisdropped)
-> Sort (cost=1.35..1.35
rows=1 width=77)
Sort Key:
s.connamespace, s.conrelid
-> Seq Scan on
pg_constraint s (cost=0.00..1.34 rows=1 width=77)
Filter: (consrc
~~* '%geometrytype(% = %'::text)
-> Materialize (cost=0.42..2.45 rows=1
width=4)
-> Index Scan using
pg_type_typname_nsp_index on pg_type t (cost=0.42..2.44 rows=1 width=4)
Index Cond: (typname =
'geometry'::name)
-> Index Scan using
pg_constraint_conrelid_index on pg_constraint s_1 (cost=0.14..0.16 rows=1
width=77)
Index Cond: (conrelid = c_1.oid)
Filter: (consrc ~~* '%ndims(% = %'::text)
-> Index Scan using pg_constraint_conrelid_index on
pg_constraint s_2 (cost=0.14..0.16 rows=1 width=77)
Index Cond: (conrelid = c_1.oid)
Filter: (consrc ~~* '%srid(% = %'::text)
-> Index Scan using pg_class_relname_nsp_index on pg_class
c (cost=0.42..2.46 rows=1 width=73)
Index Cond: ((relname = c_1.relname) AND (relnamespace
= n.oid))
Filter: has_table_privilege((((('"'::text ||
(n.nspname)::text) || '"."'::text) || (relname)::text) || '"'::text),
'select'::text)
(51 rows)

Thank you for pointing. I'll try to reproduce this issue and fix it.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#64

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Alexander Korotkov (#63)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On Wed, Mar 21, 2018 at 2:32 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Wed, Mar 21, 2018 at 2:30 PM, Darafei "Komяpa" Praliaskouski <
me@komzpa.net> wrote:

on a PostGIS system tuned for preferring parallelism heavily (
min_parallel_table_scan_size=10kB) we experience issues with QGIS table
discovery query with this patch:

Failing query is:
[local] gis@gis=# SELECT l.f_table_name,l.f_table_schem
a,l.f_geometry_column,upper(l.type),l.srid,l.coord_dimension
,c.relkind,obj_description(c.oid) FROM geometry_columns l,pg_class
c,pg_namespace n WHERE c.relname=l.f_table_name AND l.f_table_schema=n.
nspname AND n.oid=c.relnamespace AND has_schema_privilege(n.nspname,'usage')
AND has_table_privilege('"'||n.nspname||'"."'||c.relname||'"','select')
AND l.f_table_schema='public' ORDER BY n.nspname,c.relname,l.f_geometry_column;

ERROR: XX000: badly formatted node string "INCREMENTALSORT :startup_cost
37"...
CONTEXT: parallel worker
LOCATION: parseNodeString, readfuncs.c:2693
Time: 42,052 ms

Query plan:

QUERY PLAN

────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────
──────────────────
Sort (cost=38717.21..38717.22 rows=1 width=393)
Sort Key: c_1.relname, a.attname
-> Nested Loop (cost=36059.35..38717.20 rows=1 width=393)
-> Index Scan using pg_namespace_nspname_index on pg_namespace
n (cost=0.28..2.30 rows=1 width=68)
Index Cond: (nspname = 'public'::name)
Filter: has_schema_privilege((nspname)::text,
'usage'::text)
-> Nested Loop (cost=36059.08..38714.59 rows=1 width=407)
-> Nested Loop Left Join (cost=36058.65..38712.12 rows=1
width=334)
Join Filter: ((s_2.connamespace = n_1.oid) AND
(a.attnum = ANY (s_2.conkey)))
-> Nested Loop Left Join (cost=36058.51..38711.94
rows=1 width=298)
Join Filter: ((s_1.connamespace = n_1.oid) AND
(a.attnum = ANY (s_1.conkey)))
-> Nested Loop (cost=36058.38..38711.75
rows=1 width=252)
Join Filter: (a.atttypid = t.oid)
-> Gather Merge
(cost=36057.95..38702.65 rows=444 width=256)
Workers Planned: 10
-> Merge Left Join
(cost=35057.76..37689.01 rows=44 width=256)
Merge Cond: ((n_1.oid =
s.connamespace) AND (c_1.oid = s.conrelid))
Join Filter: (a.attnum = ANY
(s.conkey))
-> Incremental Sort
(cost=37687.19..37687.30 rows=44 width=210)
Sort Key: n_1.oid,
c_1.oid
Presorted Key: n_1.oid
-> Nested Loop
(cost=34837.25..37685.99 rows=44 width=210)
-> Merge Join
(cost=34836.82..34865.99 rows=9 width=136)
Merge Cond:
(c_1.relnamespace = n_1.oid)
-> Sort
(cost=34834.52..34849.05 rows=5814 width=72)
Sort
Key: c_1.relnamespace
->
Parallel Seq Scan on pg_class c_1 (cost=0.00..34470.99 rows=5814 width=72)

Filter: ((relname <> 'raster_columns'::name) AND (NOT
pg_is_other_temp_schema(relnamespace)) AND has_table_privilege(oid,
'SELECT'::text) AND (relkind = ANY ('{r,v,m,f,p}'::"char"[])))
-> Sort
(cost=2.30..2.31 rows=1 width=68)
Sort
Key: n_1.oid
->
Index Scan using pg_namespace_nspname_index on pg_namespace n_1
(cost=0.28..2.29 rows=1 width=68)

Index Cond: (nspname = 'public'::name)
-> Index Scan
using pg_attribute_relid_attnum_index on pg_attribute a
(cost=0.43..200.52 rows=11281 width=78)
Index Cond:
(attrelid = c_1.oid)
Filter:
(NOT attisdropped)
-> Sort (cost=1.35..1.35
rows=1 width=77)
Sort Key:
s.connamespace, s.conrelid
-> Seq Scan on
pg_constraint s (cost=0.00..1.34 rows=1 width=77)
Filter: (consrc
~~* '%geometrytype(% = %'::text)
-> Materialize (cost=0.42..2.45 rows=1
width=4)
-> Index Scan using
pg_type_typname_nsp_index on pg_type t (cost=0.42..2.44 rows=1 width=4)
Index Cond: (typname =
'geometry'::name)
-> Index Scan using
pg_constraint_conrelid_index on pg_constraint s_1 (cost=0.14..0.16 rows=1
width=77)
Index Cond: (conrelid = c_1.oid)
Filter: (consrc ~~* '%ndims(% = %'::text)
-> Index Scan using pg_constraint_conrelid_index on
pg_constraint s_2 (cost=0.14..0.16 rows=1 width=77)
Index Cond: (conrelid = c_1.oid)
Filter: (consrc ~~* '%srid(% = %'::text)
-> Index Scan using pg_class_relname_nsp_index on pg_class
c (cost=0.42..2.46 rows=1 width=73)
Index Cond: ((relname = c_1.relname) AND
(relnamespace = n.oid))
Filter: has_table_privilege((((('"'::text ||
(n.nspname)::text) || '"."'::text) || (relname)::text) || '"'::text),
'select'::text)
(51 rows)

Thank you for pointing. I'll try to reproduce this issue and fix it.

I found that Darafei used build made using incremental-sort-7.patch. That
version contained bug in incremental sort node deserialization.
Modern patch versions doesn't contain that bug.

I've checked that it works.

create table t (i int, value float8);
insert into t select i%1000, random() from generate_series(1,1000000) i;
set force_parallel_mode = on;

# explain select count(*) from (select * from (select * from t order by i)
x order by i, value) y;
QUERY PLAN
------------------------------------------------------------------------------------
Gather (cost=254804.94..254805.05 rows=1 width=8)
Workers Planned: 1
Single Copy: true
-> Aggregate (cost=253804.94..253804.95 rows=1 width=8)
-> Incremental Sort (cost=132245.97..241304.94 rows=1000000
width=12)
Sort Key: t.i, t.value
Presorted Key: t.i
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: t.i
-> Seq Scan on t (cost=0.00..15406.00 rows=1000000
width=12)
(10 rows)

# select count(*) from (select * from (select * from t order by i) x order
by i, value) y;
count
---------
1000000
(1 row)

BTW, patch had conflicts with master. Please, find rebased version
attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-19.patchapplication/octet-stream; name=incremental-sort-19.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index a2b13846e0..3eab376391 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4d2e43c9f0..729086ee29 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -514,7 +514,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0d61dcb179..4057c9c920 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3692,6 +3692,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c38d178cd9..02df5dfd59 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1005,6 +1009,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1615,6 +1622,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -1942,14 +1955,37 @@ static void
 show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 {
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+	int			presortedCols;
+
+	if (IsA(plan, IncrementalSort))
+		presortedCols = ((IncrementalSort *) plan)->presortedCols;
+	else
+		presortedCols = 0;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, presortedCols, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -1960,7 +1996,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -1984,7 +2020,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2053,7 +2089,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2110,7 +2146,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2123,13 +2159,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2169,9 +2206,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2379,6 +2420,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->groupsCount);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->groupsCount, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		groupsCount;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, groupsCount, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 14b0b89463..6c597c5b20 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -916,6 +925,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -976,6 +986,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1225,6 +1238,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 1b1334006f..77013909a8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -373,7 +373,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
-												  NULL, false);
+												  NULL, false, false);
 	}
 
 	aggstate->current_phase = newphase;
@@ -460,7 +460,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, false);
+									 work_mem, NULL, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1f5e41f95a
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,631 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is a specially optimized kind of multikey sort used
+ *		when the input is already presorted by a prefix of the required keys
+ *		list.  Thus, when it's required to sort by (key1, key2 ... keyN) and
+ *		result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ *		where values of (key1, key2 ... keyM) are equal.
+ *
+ *		Consider the following example.  We have input tuples consisting from
+ *		two integers (x, y) already presorted by x, while it's required to
+ *		sort them by x and y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 10)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would sort by y following groups, which have
+ *		equal x, individually:
+ *			(1, 5) (1, 2)
+ *			(2, 10) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		following tuple set which is actually sorted by x and y.
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 10)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort is faster than full sort on large datasets.  But
+ *		the case of most huge benefit of incremental sort is queries with
+ *		LIMIT because incremental sort can return first tuples without reading
+ *		whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presortedKeys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presortedKeys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presortedKeys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check if first "presortedCols" sort values are equal.
+ */
+static bool
+cmpSortPresortedCols(IncrementalSortState *node, TupleTableSlot *a,
+															TupleTableSlot *b)
+{
+	int n, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	n = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	for (i = n - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presortedKeys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(a, attno, &isnullA);
+		datumB = slot_getattr(b, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presortedKeys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Copying of tuples to the node->grpPivotSlot introduces some overhead.  It's
+ * especially notable when groups are containing one or few tuples.  In order
+ * to cope this problem we don't copy pivot tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from sorted set if any.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * If first time through, read all tuples from outer plan and pass them to
+	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	/*
+	 * Initialize tuplesort module.
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "calling tuplesort_begin");
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->groupsCount++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* Put saved tuple to tuplesort if any */
+	if (!TupIsNull(node->grpPivotSlot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->grpPivotSlot);
+		ExecClearTuple(node->grpPivotSlot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/* Put next group of presorted data to the tuplesort */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Save last tuple in minimal group */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->grpPivotSlot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/* Iterate while presorted cols are the same as in saved tuple */
+			if (cmpSortPresortedCols(node, node->grpPivotSlot, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->grpPivotSlot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+															node->groupsCount;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->grpPivotSlot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->groupsCount = 0;
+	incrsortstate->presortedKeys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->grpPivotSlot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->grpPivotSlot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..457e774b3d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,9 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 3ad4da64aa..df0b49b8c5 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -920,6 +920,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -931,13 +949,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4833,6 +4867,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index fd80891954..e59fa0d7a1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -876,12 +876,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -903,6 +901,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3755,6 +3771,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 068db353d7..c50365c56a 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2066,12 +2066,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2080,6 +2081,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2647,6 +2674,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8735e29807..78d5d7e3bf 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3646,6 +3646,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 36b3dfabb8..2bd9968d95 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1614,6 +1615,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  *	  Determines and returns the cost of sorting a relation, including
  *	  the cost of reading the input data.
  *
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys.  In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys.  And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly.  Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
  * comparisons for t tuples.
@@ -1640,7 +1648,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * work that has to be done to prepare the inputs to the comparison operators.
  *
  * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
@@ -1656,19 +1666,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
+	Cost		startup_cost = input_startup_cost;
+	Cost		run_cost = 0,
+				rest_cost,
+				group_cost,
+				input_run_cost = input_total_cost - input_startup_cost;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
+	double		num_groups,
+				group_input_bytes,
+				group_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
 	if (!enable_sort)
 		startup_cost += disable_cost;
+	if (!enable_incrementalsort)
+		presorted_keys = 0;
 
 	path->rows = tuples;
 
@@ -1694,13 +1713,56 @@ cost_sort(Path *path, PlannerInfo *root,
 		output_bytes = input_bytes;
 	}
 
-	if (output_bytes > sort_mem_bytes)
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	if (presorted_keys > 0)
+	{
+		List	   *presortedExprs = NIL;
+		ListCell   *l;
+		int			i = 0;
+
+		/* Extract presorted keys as list of expressions */
+		foreach(l, pathkeys)
+		{
+			PathKey *key = (PathKey *)lfirst(l);
+			EquivalenceMember *member = (EquivalenceMember *)
+										linitial(key->pk_eclass->ec_members);
+
+			presortedExprs = lappend(presortedExprs, member->em_expr);
+
+			i++;
+			if (i >= presorted_keys)
+				break;
+		}
+
+		/* Estimate number of groups with equal presorted keys */
+		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+
+		/*
+		 * Estimate average cost of sorting of one group where presorted keys
+		 * are equal.  Incremental sort is sensitive to distribution of tuples
+		 * to the groups, where we're relying on quite rough assumptions.  Thus,
+		 * we're pessimistic about incremental sort performance and increase
+		 * its average group size by half.
+		 */
+		group_input_bytes = 1.5 * input_bytes / num_groups;
+		group_tuples = 1.5 * tuples / num_groups;
+	}
+	else
+	{
+		num_groups = 1.0;
+		group_input_bytes = input_bytes;
+		group_tuples = tuples;
+	}
+
+	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
-		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = input_bytes / sort_mem_bytes;
+		double		npages = ceil(group_input_bytes / BLCKSZ);
+		double		nruns = group_input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1710,7 +1772,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 
 		/* Disk costs */
 
@@ -1721,10 +1783,10 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		group_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
-	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1732,14 +1794,33 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
-		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		/*
+		 * We'll use plain quicksort on all the input tuples.  If it appears
+		 * that we expect less than two tuples per sort group then assume
+		 * logarithmic part of estimate to be 1.
+		 */
+		if (group_tuples >= 2.0)
+			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+		else
+			group_cost = comparison_cost * group_tuples;
 	}
 
+	/* Add per group cost of fetching tuples from input */
+	group_cost += input_run_cost / num_groups;
+
+	/*
+	 * We've to sort first group to start output from node. Sorting rest of
+	 * groups are required to return all the other tuples.
+	 */
+	startup_cost += group_cost;
+	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+	if (rest_cost > 0.0)
+		run_cost += rest_cost;
+
 	/*
 	 * Also charge a small amount (arbitrarily set equal to operator cost) per
 	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
@@ -1750,6 +1831,20 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	run_cost += cpu_operator_cost * tuples;
 
+	/* Extra costs of incremental sort */
+	if (presorted_keys > 0)
+	{
+		/*
+		 * In incremental sort case we also have to cost the detection of
+		 * sort groups.  This turns out to be one extra copy and comparison
+		 * per tuple.
+		 */
+		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+		/* Cost of per group tuplesort reset */
+		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+	}
+
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
@@ -2727,6 +2822,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->pathtarget->width,
@@ -2753,6 +2850,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->pathtarget->width,
@@ -2989,18 +3088,17 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
 	 * inner path is to be used directly (without sorting) and it doesn't
 	 * support mark/restore.
 	 *
-	 * Since the inner side must be ordered, and only Sorts and IndexScans can
-	 * create order to begin with, and they both support mark/restore, you
-	 * might think there's no problem --- but you'd be wrong.  Nestloop and
-	 * merge joins can *preserve* the order of their inputs, so they can be
-	 * selected as the input of a mergejoin, and they don't support
-	 * mark/restore at present.
+	 * Sorts and IndexScans support mark/restore, but IncrementalSorts don't.
+	 * Also Nestloop and merge joins can *preserve* the order of their inputs,
+	 * so they can be selected as the input of a mergejoin, and they don't
+	 * support mark/restore at present.
 	 *
 	 * We don't test the value of enable_material here, because
 	 * materialization is required for correctness in this case, and turning
 	 * it off does not entitle us to deliver an invalid plan.
 	 */
-	else if (innersortkeys == NIL &&
+	else if ((innersortkeys == NIL ||
+			  pathkeys_common(innersortkeys, inner_path->pathkeys) > 0) &&
 			 !ExecSupportsMarkRestore(inner_path))
 		path->materialize_inner = true;
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..869c7c0b16 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,7 @@ compare_pathkeys(List *keys1, List *keys2)
 	return PATHKEYS_EQUAL;
 }
 
+
 /*
  * pathkeys_contained_in
  *	  Common special case of compare_pathkeys: we just want to know
@@ -327,6 +330,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL); 
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1580,26 +1628,45 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
  */
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 {
-	if (root->query_pathkeys == NIL)
+	int	n_common_pathkeys;
+
+	if (query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	if (pathkeys_common_contained_in(query_pathkeys, pathkeys, &n_common_pathkeys))
 	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		/* Full match of pathkeys: always useful */
+		return n_common_pathkeys;
+	}
+	else
+	{
+		if (enable_incrementalsort)
+		{
+			/*
+			 * Return the number of path keys in common, or 0 if there are none.
+			 * Any leading common pathkeys could be useful for ordering because
+			 * we can use the incremental sort.
+			 */
+			return n_common_pathkeys;
+		}
+		else
+		{
+			/*
+			 * When incremental sort is disabled, pathkeys are useful only when
+			 * they do contain all the query pathkeys.
+			 */
+			return 0;
+		}
 	}
-
-	return 0;					/* path ordering not useful */
 }
 
 /*
@@ -1615,7 +1682,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
 	int			nuseful2;
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
-	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9ae1bf31d5..30b91bd5bc 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree);
+						 Plan *lefttree,
+						 int presortedCols);
 static Material *make_material(Plan *lefttree);
 static WindowAgg *make_windowagg(List *tlist, Index winref,
 			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -443,6 +444,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1128,6 +1130,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		Oid		   *sortOperators;
 		Oid		   *collations;
 		bool	   *nullsFirst;
+		int			n_common_pathkeys;
 
 		/* Build the child plan */
 		/* Must insist that all children return the same tlist */
@@ -1162,9 +1165,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (!pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+										  &n_common_pathkeys))
 		{
 			Sort	   *sort = make_sort(subplan, numsortkeys,
+										 n_common_pathkeys,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1514,6 +1519,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 	Plan	   *subplan;
 	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	int			n_common_pathkeys;
 
 	/* As with Gather, it's best to project away columns in the workers. */
 	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1543,12 +1549,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+	if (!pathkeys_common_contained_in(pathkeys, best_path->subpath->pathkeys,
+									  &n_common_pathkeys))
+	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 n_common_pathkeys,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
+	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
@@ -1661,6 +1671,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1670,7 +1681,13 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
-	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL);
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->presortedCols;
+	else
+		n_common_pathkeys = 0;
+
+	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
+								   NULL, n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -1914,7 +1931,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
 											 new_grpColIdx,
-											 subplan);
+											 subplan,
+											 0);
 			}
 
 			if (!rollup->is_hashed)
@@ -3862,10 +3880,15 @@ create_mergejoin_plan(PlannerInfo *root,
 	 */
 	if (best_path->outersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		outer_relids = outer_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
-												   best_path->outersortkeys,
-												   outer_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+									best_path->jpath.outerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+									   outer_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3876,10 +3899,15 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	if (best_path->innersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		inner_relids = inner_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
-												   best_path->innersortkeys,
-												   inner_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+									best_path->jpath.innerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+									   inner_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -4934,8 +4962,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
+	int			presorted_cols = 0;
+
+	if (IsA(plan, IncrementalSort))
+		presorted_cols = ((IncrementalSort *) plan)->presortedCols;
 
-	cost_sort(&sort_path, root, NIL,
+	cost_sort(&sort_path, root, NIL, presorted_cols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -5526,13 +5559,31 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	/* Always use regular sort node when enable_incrementalsort = false */
+	if (!enable_incrementalsort)
+		presortedCols = 0;
+
+	if (presortedCols == 0)
+	{
+		node = makeNode(Sort);
+	}
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->presortedCols = presortedCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5865,9 +5916,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5887,7 +5940,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5930,7 +5983,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5951,7 +6004,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 static Sort *
 make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree)
+						 Plan *lefttree,
+						 int presortedCols)
 {
 	List	   *sub_tlist = lefttree->targetlist;
 	ListCell   *l;
@@ -5984,7 +6038,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6649,6 +6703,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
 #include "parser/parse_clause.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 85805ff5c7..71d1d3d7ef 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4746,13 +4746,13 @@ create_ordered_paths(PlannerInfo *root,
 	foreach(lc, input_rel->pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
-		bool		is_sorted;
+		int			n_useful_pathkeys;
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+														 path->pathkeys);
+		if (path == cheapest_input_path || n_useful_pathkeys > 0)
 		{
-			if (!is_sorted)
+			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -5884,8 +5884,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->reltarget->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
@@ -6120,14 +6121,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
-			bool		is_sorted;
+			int			n_useful_pathkeys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
-			if (path == cheapest_path || is_sorted)
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+									root->group_pathkeys, path->pathkeys);
+			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
-				if (!is_sorted)
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -6190,12 +6191,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				int			n_useful_pathkeys;
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
-				 * sorting anything but the cheapest path.
+				 * non-incremental sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+				if (n_useful_pathkeys == 0 &&
+					path != partially_grouped_rel->cheapest_total_path)
+					continue;
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4617d12cb9..be520e6086 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 83008d7661..313cad266f 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2795,6 +2795,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index f087369f75..7cc11c4e3a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1021,7 +1021,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_path->startup_cost;
 	sorted_p.total_cost = input_path->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0, 
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_path->rows, input_path->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 22133fcf12..acd15da0a4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
 }
 
 /*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
  *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
  *	  or more expensive than path2 for fetching the specified fraction
  *	  of the total tuples.
@@ -1362,12 +1362,14 @@ create_merge_append_path(PlannerInfo *root,
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
+		int			n_common_pathkeys;
 
 		pathnode->path.rows += subpath->rows;
 		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
 			subpath->parallel_safe;
 
-		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+										 &n_common_pathkeys))
 		{
 			/* Subpath is adequately ordered, we won't need to sort it */
 			input_startup_cost += subpath->startup_cost;
@@ -1381,6 +1383,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  n_common_pathkeys,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->pathtarget->width,
@@ -1628,7 +1632,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  subpath->pathtarget->width,
@@ -1721,6 +1726,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	GatherMergePath *pathnode = makeNode(GatherMergePath);
 	Cost		input_startup_cost = 0;
 	Cost		input_total_cost = 0;
+	int			n_common_pathkeys;
 
 	Assert(subpath->parallel_safe);
 	Assert(pathkeys);
@@ -1737,7 +1743,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.pathtarget = target ? target : rel->reltarget;
 	pathnode->path.rows += subpath->rows;
 
-	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+	if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys, &n_common_pathkeys))
 	{
 		/* Subpath is adequately ordered, we won't need to sort it */
 		input_startup_cost += subpath->startup_cost;
@@ -1751,6 +1757,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		cost_sort(&sort_path,
 				  root,
 				  pathkeys,
+				  n_common_pathkeys,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  subpath->rows,
 				  subpath->pathtarget->width,
@@ -2610,9 +2618,35 @@ create_sort_path(PlannerInfo *root,
 				 List *pathkeys,
 				 double limit_tuples)
 {
-	SortPath   *pathnode = makeNode(SortPath);
+	SortPath   *pathnode;
+	int			n_common_pathkeys;
+
+	/*
+	 * Use incremental sort when it's enabled and there are common pathkeys,
+	 * use regular sort otherwise.
+	 */
+	if (enable_incrementalsort)
+		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+	else
+		n_common_pathkeys = 0;
+
+	if (n_common_pathkeys == 0)
+	{
+		pathnode = makeNode(SortPath);
+		pathnode->path.pathtype = T_Sort;
+	}
+	else
+	{
+		IncrementalSortPath   *incpathnode;
+
+		incpathnode = makeNode(IncrementalSortPath);
+		pathnode = &incpathnode->spath;
+		pathnode->path.pathtype = T_IncrementalSort;
+		incpathnode->presortedCols = n_common_pathkeys;
+	}
+
+	Assert(n_common_pathkeys < list_length(pathkeys));
 
-	pathnode->path.pathtype = T_Sort;
 	pathnode->path.parent = rel;
 	/* Sort doesn't project, so use source path's pathtarget */
 	pathnode->path.pathtarget = subpath->pathtarget;
@@ -2626,7 +2660,9 @@ create_sort_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 
-	cost_sort(&pathnode->path, root, pathkeys,
+	cost_sort(&pathnode->path, root,
+			  pathkeys, n_common_pathkeys,
+			  subpath->startup_cost,
 			  subpath->total_cost,
 			  subpath->rows,
 			  subpath->pathtarget->width,
@@ -2938,7 +2974,8 @@ create_groupingsets_path(PlannerInfo *root,
 			else
 			{
 				/* Account for cost of sort, but don't charge input cost again */
-				cost_sort(&sort_path, root, NIL,
+				cost_sort(&sort_path, root, NIL, 0,
+						  0.0,
 						  0.0,
 						  subpath->rows,
 						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 50b34fcbc6..0b5ce4be45 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -295,7 +295,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortNullsFirsts,
 												   work_mem,
 												   NULL,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index bf240aa9c5..b694a5828d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3716,6 +3716,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 	return numdistinct;
 }
 
+/*
+ * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+ * 							  divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into.  Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+	ListCell   *l;
+	List	   *groupExprs = NIL;
+	double	   *result;
+	int			i;
+
+	/*
+	 * Get number of groups for each prefix of pathkeys.
+	 */
+	i = 0;
+	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+							linitial(key->pk_eclass->ec_members);
+
+		groupExprs = lappend(groupExprs, member->em_expr);
+
+		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+		i++;
+	}
+
+	return result;
+}
+
 /*
  * Estimate hash bucket statistics when the specified expression is used
  * as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7a7ac479c1..8862372610 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -859,6 +859,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 041bdc2fa7..26263ab5e6 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,9 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+#define INITAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +246,13 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +657,9 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +695,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +705,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +737,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +762,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +771,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -807,14 +828,15 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate, bool randomAccess)
+					 int workMem, SortCoordinate coordinate,
+					 bool randomAccess, bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -857,7 +879,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -890,7 +912,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1007,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1086,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1129,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1246,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_free
  *
- *	Release resources and clean up.
- *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1312,110 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	if (delete)
+	{
+		MemoryContextDelete(state->maincontext);
+	}
+	else
+	{
+		MemoryContextResetOnly(state->sortcontext);
+		MemoryContextResetOnly(state->tuplecontext);
+	}
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/* XXX */
+	if (spaceUsedOnDisk > state->maxSpaceOnDisk ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state, false);
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2589,8 +2710,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2640,7 +2760,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3137,18 +3258,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d9e591802f..b698a9e4ad 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1765,6 +1765,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1793,6 +1807,45 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	PresortedKeyData *presortedKeys;	/* keys, dataset is presorted by */
+	int64		groupsCount;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *grpPivotSlot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 74b094a9c3..133bb17bdc 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -73,6 +73,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -125,6 +126,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -240,6 +242,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2e19eae68..13d9a75b50 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -751,6 +751,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index d576aa7350..5b0c63add9 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1519,6 +1519,16 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 132e35551b..00f0205be4 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -105,8 +106,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 						 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 94f9bb2b57..597c5052a9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
@@ -229,6 +231,7 @@ extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
 extern List *trim_mergeclauses_for_inner_pathkeys(PlannerInfo *root,
 									 List *mergeclauses,
 									 List *pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
 extern List *truncate_useless_pathkeys(PlannerInfo *root,
 						  RelOptInfo *rel,
 						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
 extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
 					double input_rows, List **pgset);
 
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+										double tuples);
+
 extern void estimate_hash_bucket_stats(PlannerInfo *root,
 						   Node *hashkey, double nbuckets,
 						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..eb260dfd8b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -193,7 +193,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
 					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 bool randomAccess, bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index d768dc0215..ba645562a8 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE:  drop cascades to table matest1
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
  {3,7,8,10,13,13,16,18,19,22}
 (3 rows)
 
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Merge Append
+   Sort Key: tenk1.thousand, tenk1.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+   ->  Incremental Sort
+         Sort Key: tenk1_1.thousand, tenk1_1.thousand
+         Presorted Key: tenk1_1.thousand
+         ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Merge Append
+   Sort Key: a.thousand, a.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+   ->  Incremental Sort
+         Sort Key: b.unique2, b.unique2
+         Presorted Key: b.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 --
 -- Check handling of a constant-null CHECK constraint
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 84c6e9b5a4..78728f873a 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2347,18 +2347,21 @@ select count(*) from
   left join
   (select * from tenk1 y order by y.unique2) y
   on x.thousand = y.unique2 and x.twothousand = y.hundred and x.fivethous = y.unique2;
-                                    QUERY PLAN                                    
-----------------------------------------------------------------------------------
+                                                  QUERY PLAN                                                  
+--------------------------------------------------------------------------------------------------------------
  Aggregate
    ->  Merge Left Join
-         Merge Cond: (x.thousand = y.unique2)
-         Join Filter: ((x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
+         Merge Cond: ((x.thousand = y.unique2) AND (x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
          ->  Sort
                Sort Key: x.thousand, x.twothousand, x.fivethous
                ->  Seq Scan on tenk1 x
          ->  Materialize
-               ->  Index Scan using tenk1_unique2 on tenk1 y
-(9 rows)
+               ->  Incremental Sort
+                     Sort Key: y.unique2, y.hundred
+                     Presorted Key: y.unique2
+                     ->  Subquery Scan on y
+                           ->  Index Scan using tenk1_unique2 on tenk1 y_1
+(12 rows)
 
 select count(*) from
   (select * from tenk1 x order by x.thousand, x.twothousand, x.fivethous) x
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 759f7d9d59..f855214099 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge        | on
  enable_hashagg            | on
  enable_hashjoin           | on
+ enable_incrementalsort    | on
  enable_indexonlyscan      | on
  enable_indexscan          | on
  enable_material           | on
@@ -87,7 +88,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan            | on
  enable_sort               | on
  enable_tidscan            | on
-(15 rows)
+(16 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 9397f72c13..cde4c2ee5a 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
@@ -607,9 +608,26 @@ SELECT
     ORDER BY f.i LIMIT 10)
 FROM generate_series(1, 3) g(i);
 
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 
 --
 -- Check handling of a constant-null CHECK constraint

#65

Teodor Sigaev

teodor@sigaev.ru

almost 8 years ago

In reply to: Alexander Korotkov (#64)

Re: [HACKERS] [PATCH] Incremental sort

BTW, patch had conflicts with master.О©╫ Please, find rebased version attached.

Sorry, but patch conflicts with master again.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#66

Teodor Sigaev

teodor@sigaev.ru

almost 8 years ago

In reply to: Alexander Korotkov (#64)

Re: [HACKERS] [PATCH] Incremental sort

BTW, patch had conflicts with master.О©╫ Please, find rebased version attached.

Despite by patch conflist patch looks commitable, has anybody objections to
commit it?

Patch recieved several rounds of review during 2 years, and seems to me, keeping
it out from sources may cause a lost it. Although it suggests performance
improvement in rather wide usecases.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#67

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Teodor Sigaev (#66)

Re: [HACKERS] [PATCH] Incremental sort

On 03/28/2018 03:28 PM, Teodor Sigaev wrote:

BTW, patch had conflicts with master.О©╫ Please, find rebased version
attached.

Despite by patch conflist patch looks commitable, has anybody objections
to commit it?

Patch recieved several rounds of review during 2 years, and seems to me,
keeping it out from sources may cause a lost it. Although it suggests
performance improvement in rather wide usecases.

No objections from me - if you want me to do one final round of review
after the rebase (not sure how invasive it'll turn out), let me know.

BTW one detail I'd change is name of the GUC variable. enable_incsort
seems unnecessarily terse - let's go for enable_incremental_sort or
something like that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#68

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Tomas Vondra (#67)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On Wed, Mar 28, 2018 at 4:44 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 03/28/2018 03:28 PM, Teodor Sigaev wrote:

BTW, patch had conflicts with master. Please, find rebased version
attached.

Despite by patch conflist patch looks commitable, has anybody objections
to commit it?

Patch recieved several rounds of review during 2 years, and seems to me,
keeping it out from sources may cause a lost it. Although it suggests
performance improvement in rather wide usecases.

No objections from me - if you want me to do one final round of review
after the rebase (not sure how invasive it'll turn out), let me know.

Rebased patch is attached. Incremental sort get used in multiple places
of partition_aggregate regression test. I've checked those cases, and it
seems
that incremental sort was selected right.

BTW one detail I'd change is name of the GUC variable. enable_incsort

seems unnecessarily terse - let's go for enable_incremental_sort or
something like that.

Already enable_incsort was already renamed to enable_incrementalsort
since [1].

1.
/messages/by-id/CAPpHfduAVmiGDZC+dfNL1rEGu0mt45Rd_mxwjY57uqwWhrvQzg@mail.gmail.com

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-20.patchapplication/octet-stream; name=incremental-sort-20.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 2d6e387d63..d11777cb90 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4d2e43c9f0..729086ee29 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -514,7 +514,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4d899e3b24..ea463f3105 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3692,6 +3692,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c38d178cd9..02df5dfd59 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1005,6 +1009,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1615,6 +1622,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -1942,14 +1955,37 @@ static void
 show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 {
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+	int			presortedCols;
+
+	if (IsA(plan, IncrementalSort))
+		presortedCols = ((IncrementalSort *) plan)->presortedCols;
+	else
+		presortedCols = 0;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, presortedCols, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -1960,7 +1996,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -1984,7 +2020,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2053,7 +2089,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2110,7 +2146,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2123,13 +2159,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2169,9 +2206,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2379,6 +2420,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->groupsCount);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->groupsCount, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		groupsCount;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, groupsCount, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 52f1a96db5..fc3910502b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -281,6 +282,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -494,6 +499,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -978,6 +988,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1227,6 +1240,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 1b1334006f..77013909a8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -373,7 +373,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
-												  NULL, false);
+												  NULL, false, false);
 	}
 
 	aggstate->current_phase = newphase;
@@ -460,7 +460,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, false);
+									 work_mem, NULL, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1f5e41f95a
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,631 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is a specially optimized kind of multikey sort used
+ *		when the input is already presorted by a prefix of the required keys
+ *		list.  Thus, when it's required to sort by (key1, key2 ... keyN) and
+ *		result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ *		where values of (key1, key2 ... keyM) are equal.
+ *
+ *		Consider the following example.  We have input tuples consisting from
+ *		two integers (x, y) already presorted by x, while it's required to
+ *		sort them by x and y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 10)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would sort by y following groups, which have
+ *		equal x, individually:
+ *			(1, 5) (1, 2)
+ *			(2, 10) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		following tuple set which is actually sorted by x and y.
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 10)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort is faster than full sort on large datasets.  But
+ *		the case of most huge benefit of incremental sort is queries with
+ *		LIMIT because incremental sort can return first tuples without reading
+ *		whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presortedKeys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presortedKeys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presortedKeys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check if first "presortedCols" sort values are equal.
+ */
+static bool
+cmpSortPresortedCols(IncrementalSortState *node, TupleTableSlot *a,
+															TupleTableSlot *b)
+{
+	int n, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	n = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	for (i = n - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presortedKeys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(a, attno, &isnullA);
+		datumB = slot_getattr(b, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presortedKeys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Copying of tuples to the node->grpPivotSlot introduces some overhead.  It's
+ * especially notable when groups are containing one or few tuples.  In order
+ * to cope this problem we don't copy pivot tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from sorted set if any.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * If first time through, read all tuples from outer plan and pass them to
+	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	/*
+	 * Initialize tuplesort module.
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "calling tuplesort_begin");
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->groupsCount++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* Put saved tuple to tuplesort if any */
+	if (!TupIsNull(node->grpPivotSlot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->grpPivotSlot);
+		ExecClearTuple(node->grpPivotSlot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/* Put next group of presorted data to the tuplesort */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Save last tuple in minimal group */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->grpPivotSlot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/* Iterate while presorted cols are the same as in saved tuple */
+			if (cmpSortPresortedCols(node, node->grpPivotSlot, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->grpPivotSlot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+															node->groupsCount;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->grpPivotSlot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->groupsCount = 0;
+	incrsortstate->presortedKeys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->grpPivotSlot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->grpPivotSlot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..457e774b3d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,9 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c7293a60d7..b93a7a1d43 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -921,6 +921,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -932,13 +950,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4834,6 +4868,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f61ae03ac5..9d9c90e2be 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -877,12 +877,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -904,6 +902,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3756,6 +3772,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index fd4586e73d..338bf8b835 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2067,12 +2067,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2081,6 +2082,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2648,6 +2675,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 43f4e75748..c28aa4affb 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3655,6 +3655,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 47729de896..e8cfdd81fd 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1615,6 +1616,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  *	  Determines and returns the cost of sorting a relation, including
  *	  the cost of reading the input data.
  *
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys.  In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys.  And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly.  Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
  * comparisons for t tuples.
@@ -1641,7 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * work that has to be done to prepare the inputs to the comparison operators.
  *
  * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
@@ -1657,19 +1667,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
+	Cost		startup_cost = input_startup_cost;
+	Cost		run_cost = 0,
+				rest_cost,
+				group_cost,
+				input_run_cost = input_total_cost - input_startup_cost;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
+	double		num_groups,
+				group_input_bytes,
+				group_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
 	if (!enable_sort)
 		startup_cost += disable_cost;
+	if (!enable_incrementalsort)
+		presorted_keys = 0;
 
 	path->rows = tuples;
 
@@ -1695,13 +1714,56 @@ cost_sort(Path *path, PlannerInfo *root,
 		output_bytes = input_bytes;
 	}
 
-	if (output_bytes > sort_mem_bytes)
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	if (presorted_keys > 0)
+	{
+		List	   *presortedExprs = NIL;
+		ListCell   *l;
+		int			i = 0;
+
+		/* Extract presorted keys as list of expressions */
+		foreach(l, pathkeys)
+		{
+			PathKey *key = (PathKey *)lfirst(l);
+			EquivalenceMember *member = (EquivalenceMember *)
+										linitial(key->pk_eclass->ec_members);
+
+			presortedExprs = lappend(presortedExprs, member->em_expr);
+
+			i++;
+			if (i >= presorted_keys)
+				break;
+		}
+
+		/* Estimate number of groups with equal presorted keys */
+		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+
+		/*
+		 * Estimate average cost of sorting of one group where presorted keys
+		 * are equal.  Incremental sort is sensitive to distribution of tuples
+		 * to the groups, where we're relying on quite rough assumptions.  Thus,
+		 * we're pessimistic about incremental sort performance and increase
+		 * its average group size by half.
+		 */
+		group_input_bytes = 1.5 * input_bytes / num_groups;
+		group_tuples = 1.5 * tuples / num_groups;
+	}
+	else
+	{
+		num_groups = 1.0;
+		group_input_bytes = input_bytes;
+		group_tuples = tuples;
+	}
+
+	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
-		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = input_bytes / sort_mem_bytes;
+		double		npages = ceil(group_input_bytes / BLCKSZ);
+		double		nruns = group_input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1711,7 +1773,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 
 		/* Disk costs */
 
@@ -1722,10 +1784,10 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		group_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
-	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1733,14 +1795,33 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
-		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		/*
+		 * We'll use plain quicksort on all the input tuples.  If it appears
+		 * that we expect less than two tuples per sort group then assume
+		 * logarithmic part of estimate to be 1.
+		 */
+		if (group_tuples >= 2.0)
+			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+		else
+			group_cost = comparison_cost * group_tuples;
 	}
 
+	/* Add per group cost of fetching tuples from input */
+	group_cost += input_run_cost / num_groups;
+
+	/*
+	 * We've to sort first group to start output from node. Sorting rest of
+	 * groups are required to return all the other tuples.
+	 */
+	startup_cost += group_cost;
+	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+	if (rest_cost > 0.0)
+		run_cost += rest_cost;
+
 	/*
 	 * Also charge a small amount (arbitrarily set equal to operator cost) per
 	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
@@ -1751,6 +1832,20 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	run_cost += cpu_operator_cost * tuples;
 
+	/* Extra costs of incremental sort */
+	if (presorted_keys > 0)
+	{
+		/*
+		 * In incremental sort case we also have to cost the detection of
+		 * sort groups.  This turns out to be one extra copy and comparison
+		 * per tuple.
+		 */
+		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+		/* Cost of per group tuplesort reset */
+		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+	}
+
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
@@ -2728,6 +2823,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->pathtarget->width,
@@ -2754,6 +2851,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->pathtarget->width,
@@ -2990,18 +3089,17 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
 	 * inner path is to be used directly (without sorting) and it doesn't
 	 * support mark/restore.
 	 *
-	 * Since the inner side must be ordered, and only Sorts and IndexScans can
-	 * create order to begin with, and they both support mark/restore, you
-	 * might think there's no problem --- but you'd be wrong.  Nestloop and
-	 * merge joins can *preserve* the order of their inputs, so they can be
-	 * selected as the input of a mergejoin, and they don't support
-	 * mark/restore at present.
+	 * Sorts and IndexScans support mark/restore, but IncrementalSorts don't.
+	 * Also Nestloop and merge joins can *preserve* the order of their inputs,
+	 * so they can be selected as the input of a mergejoin, and they don't
+	 * support mark/restore at present.
 	 *
 	 * We don't test the value of enable_material here, because
 	 * materialization is required for correctness in this case, and turning
 	 * it off does not entitle us to deliver an invalid plan.
 	 */
-	else if (innersortkeys == NIL &&
+	else if ((innersortkeys == NIL ||
+			  pathkeys_common(innersortkeys, inner_path->pathkeys) > 0) &&
 			 !ExecSupportsMarkRestore(inner_path))
 		path->materialize_inner = true;
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..869c7c0b16 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -308,6 +310,7 @@ compare_pathkeys(List *keys1, List *keys2)
 	return PATHKEYS_EQUAL;
 }
 
+
 /*
  * pathkeys_contained_in
  *	  Common special case of compare_pathkeys: we just want to know
@@ -327,6 +330,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL); 
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1580,26 +1628,45 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
  */
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 {
-	if (root->query_pathkeys == NIL)
+	int	n_common_pathkeys;
+
+	if (query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	if (pathkeys_common_contained_in(query_pathkeys, pathkeys, &n_common_pathkeys))
 	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		/* Full match of pathkeys: always useful */
+		return n_common_pathkeys;
+	}
+	else
+	{
+		if (enable_incrementalsort)
+		{
+			/*
+			 * Return the number of path keys in common, or 0 if there are none.
+			 * Any leading common pathkeys could be useful for ordering because
+			 * we can use the incremental sort.
+			 */
+			return n_common_pathkeys;
+		}
+		else
+		{
+			/*
+			 * When incremental sort is disabled, pathkeys are useful only when
+			 * they do contain all the query pathkeys.
+			 */
+			return 0;
+		}
 	}
-
-	return 0;					/* path ordering not useful */
 }
 
 /*
@@ -1615,7 +1682,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
 	int			nuseful2;
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
-	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 8b4f031d96..e047e7736b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree);
+						 Plan *lefttree,
+						 int presortedCols);
 static Material *make_material(Plan *lefttree);
 static WindowAgg *make_windowagg(List *tlist, Index winref,
 			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -443,6 +444,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1128,6 +1130,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		Oid		   *sortOperators;
 		Oid		   *collations;
 		bool	   *nullsFirst;
+		int			n_common_pathkeys;
 
 		/* Build the child plan */
 		/* Must insist that all children return the same tlist */
@@ -1162,9 +1165,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (!pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+										  &n_common_pathkeys))
 		{
 			Sort	   *sort = make_sort(subplan, numsortkeys,
+										 n_common_pathkeys,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1514,6 +1519,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 	Plan	   *subplan;
 	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	int			n_common_pathkeys;
 
 	/* As with Gather, it's best to project away columns in the workers. */
 	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1543,12 +1549,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+	if (!pathkeys_common_contained_in(pathkeys, best_path->subpath->pathkeys,
+									  &n_common_pathkeys))
+	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 n_common_pathkeys,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
+	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
@@ -1661,6 +1671,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1670,6 +1681,11 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->presortedCols;
+	else
+		n_common_pathkeys = 0;
+
 	/*
 	 * make_sort_from_pathkeys() indirectly calls find_ec_member_for_tle(),
 	 * which will ignore any child EC members that don't belong to the given
@@ -1678,7 +1694,8 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	 */
 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
 								   IS_OTHER_REL(best_path->subpath->parent) ?
-								   best_path->path.parent->relids : NULL);
+								   best_path->path.parent->relids : NULL,
+								   n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -1922,7 +1939,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
 											 new_grpColIdx,
-											 subplan);
+											 subplan,
+											 0);
 			}
 
 			if (!rollup->is_hashed)
@@ -3870,10 +3888,15 @@ create_mergejoin_plan(PlannerInfo *root,
 	 */
 	if (best_path->outersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		outer_relids = outer_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
-												   best_path->outersortkeys,
-												   outer_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+									best_path->jpath.outerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+									   outer_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3884,10 +3907,15 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	if (best_path->innersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		inner_relids = inner_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
-												   best_path->innersortkeys,
-												   inner_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+									best_path->jpath.innerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+									   inner_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -4942,8 +4970,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
+	int			presorted_cols = 0;
+
+	if (IsA(plan, IncrementalSort))
+		presorted_cols = ((IncrementalSort *) plan)->presortedCols;
 
-	cost_sort(&sort_path, root, NIL,
+	cost_sort(&sort_path, root, NIL, presorted_cols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -5534,13 +5567,31 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	/* Always use regular sort node when enable_incrementalsort = false */
+	if (!enable_incrementalsort)
+		presortedCols = 0;
+
+	if (presortedCols == 0)
+	{
+		node = makeNode(Sort);
+	}
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->presortedCols = presortedCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5873,9 +5924,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5895,7 +5948,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5938,7 +5991,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5959,7 +6012,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 static Sort *
 make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree)
+						 Plan *lefttree,
+						 int presortedCols)
 {
 	List	   *sub_tlist = lefttree->targetlist;
 	ListCell   *l;
@@ -5992,7 +6046,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6657,6 +6711,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
 #include "parser/parse_clause.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 52c21e6870..f116743734 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4825,13 +4825,13 @@ create_ordered_paths(PlannerInfo *root,
 	foreach(lc, input_rel->pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
-		bool		is_sorted;
+		int			n_useful_pathkeys;
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+														 path->pathkeys);
+		if (path == cheapest_input_path || n_useful_pathkeys > 0)
 		{
-			if (!is_sorted)
+			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -5963,8 +5963,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->reltarget->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
@@ -6202,14 +6203,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
-			bool		is_sorted;
+			int			n_useful_pathkeys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
-			if (path == cheapest_path || is_sorted)
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+									root->group_pathkeys, path->pathkeys);
+			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
-				if (!is_sorted)
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -6272,12 +6273,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				int			n_useful_pathkeys;
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
-				 * sorting anything but the cheapest path.
+				 * non-incremental sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+				if (n_useful_pathkeys == 0 &&
+					path != partially_grouped_rel->cheapest_total_path)
+					continue;
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 69dd327f0c..08a9545634 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 83008d7661..313cad266f 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2795,6 +2795,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 6e510f9d94..2062237c0a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1110,7 +1110,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_path->startup_cost;
 	sorted_p.total_cost = input_path->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0, 
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_path->rows, input_path->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 22133fcf12..acd15da0a4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -105,7 +105,7 @@ compare_path_costs(Path *path1, Path *path2, CostSelector criterion)
 }
 
 /*
- * compare_path_fractional_costs
+ * compare_fractional_path_costs
  *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
  *	  or more expensive than path2 for fetching the specified fraction
  *	  of the total tuples.
@@ -1362,12 +1362,14 @@ create_merge_append_path(PlannerInfo *root,
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
+		int			n_common_pathkeys;
 
 		pathnode->path.rows += subpath->rows;
 		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
 			subpath->parallel_safe;
 
-		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+										 &n_common_pathkeys))
 		{
 			/* Subpath is adequately ordered, we won't need to sort it */
 			input_startup_cost += subpath->startup_cost;
@@ -1381,6 +1383,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  n_common_pathkeys,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->pathtarget->width,
@@ -1628,7 +1632,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  subpath->pathtarget->width,
@@ -1721,6 +1726,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	GatherMergePath *pathnode = makeNode(GatherMergePath);
 	Cost		input_startup_cost = 0;
 	Cost		input_total_cost = 0;
+	int			n_common_pathkeys;
 
 	Assert(subpath->parallel_safe);
 	Assert(pathkeys);
@@ -1737,7 +1743,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.pathtarget = target ? target : rel->reltarget;
 	pathnode->path.rows += subpath->rows;
 
-	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+	if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys, &n_common_pathkeys))
 	{
 		/* Subpath is adequately ordered, we won't need to sort it */
 		input_startup_cost += subpath->startup_cost;
@@ -1751,6 +1757,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		cost_sort(&sort_path,
 				  root,
 				  pathkeys,
+				  n_common_pathkeys,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  subpath->rows,
 				  subpath->pathtarget->width,
@@ -2610,9 +2618,35 @@ create_sort_path(PlannerInfo *root,
 				 List *pathkeys,
 				 double limit_tuples)
 {
-	SortPath   *pathnode = makeNode(SortPath);
+	SortPath   *pathnode;
+	int			n_common_pathkeys;
+
+	/*
+	 * Use incremental sort when it's enabled and there are common pathkeys,
+	 * use regular sort otherwise.
+	 */
+	if (enable_incrementalsort)
+		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+	else
+		n_common_pathkeys = 0;
+
+	if (n_common_pathkeys == 0)
+	{
+		pathnode = makeNode(SortPath);
+		pathnode->path.pathtype = T_Sort;
+	}
+	else
+	{
+		IncrementalSortPath   *incpathnode;
+
+		incpathnode = makeNode(IncrementalSortPath);
+		pathnode = &incpathnode->spath;
+		pathnode->path.pathtype = T_IncrementalSort;
+		incpathnode->presortedCols = n_common_pathkeys;
+	}
+
+	Assert(n_common_pathkeys < list_length(pathkeys));
 
-	pathnode->path.pathtype = T_Sort;
 	pathnode->path.parent = rel;
 	/* Sort doesn't project, so use source path's pathtarget */
 	pathnode->path.pathtarget = subpath->pathtarget;
@@ -2626,7 +2660,9 @@ create_sort_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 
-	cost_sort(&pathnode->path, root, pathkeys,
+	cost_sort(&pathnode->path, root,
+			  pathkeys, n_common_pathkeys,
+			  subpath->startup_cost,
 			  subpath->total_cost,
 			  subpath->rows,
 			  subpath->pathtarget->width,
@@ -2938,7 +2974,8 @@ create_groupingsets_path(PlannerInfo *root,
 			else
 			{
 				/* Account for cost of sort, but don't charge input cost again */
-				cost_sort(&sort_path, root, NIL,
+				cost_sort(&sort_path, root, NIL, 0,
+						  0.0,
 						  0.0,
 						  subpath->rows,
 						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 50b34fcbc6..0b5ce4be45 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -295,7 +295,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortNullsFirsts,
 												   work_mem,
 												   NULL,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index bf240aa9c5..b694a5828d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3716,6 +3716,42 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 	return numdistinct;
 }
 
+/*
+ * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+ * 							  divided to by pathkeys.
+ *
+ * Returns an array of group numbers. i'th element of array is number of groups
+ * which first i pathkeys divides dataset into.  Actually is a convenience
+ * wrapper over estimate_num_groups().
+ */
+double *
+estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+{
+	ListCell   *l;
+	List	   *groupExprs = NIL;
+	double	   *result;
+	int			i;
+
+	/*
+	 * Get number of groups for each prefix of pathkeys.
+	 */
+	i = 0;
+	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+							linitial(key->pk_eclass->ec_members);
+
+		groupExprs = lappend(groupExprs, member->em_expr);
+
+		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+		i++;
+	}
+
+	return result;
+}
+
 /*
  * Estimate hash bucket statistics when the specified expression is used
  * as a hash key for the given number of buckets.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d075cb139a..511528a0f3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -860,6 +860,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 041bdc2fa7..26263ab5e6 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,9 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+#define INITAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +246,13 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +657,9 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state, bool delete);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +695,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +705,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +737,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +762,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +771,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -807,14 +828,15 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate, bool randomAccess)
+					 int workMem, SortCoordinate coordinate,
+					 bool randomAccess, bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -857,7 +879,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -890,7 +912,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1007,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1086,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1129,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1246,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_free
  *
- *	Release resources and clean up.
- *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state, bool delete)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1312,110 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	if (delete)
+	{
+		MemoryContextDelete(state->maincontext);
+	}
+	else
+	{
+		MemoryContextResetOnly(state->sortcontext);
+		MemoryContextResetOnly(state->tuplecontext);
+	}
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state, true);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/* XXX */
+	if (spaceUsedOnDisk > state->maxSpaceOnDisk ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state, false);
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2589,8 +2710,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2640,7 +2760,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3137,18 +3258,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6070a42b6f..bf379f7f20 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1807,6 +1807,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1835,6 +1849,45 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	PresortedKeyData *presortedKeys;	/* keys, dataset is presorted by */
+	int64		groupsCount;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *grpPivotSlot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 443de22704..4bc270ed6f 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -126,6 +127,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -241,6 +243,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index c922216b7d..974112d086 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -753,6 +753,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index abbbda9e91..86db5098e4 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1523,6 +1523,16 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d3269eae71..60edbd996f 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -106,8 +107,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 						 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 50e180c554..26787a6221 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
@@ -229,6 +231,7 @@ extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
 extern List *trim_mergeclauses_for_inner_pathkeys(PlannerInfo *root,
 									 List *mergeclauses,
 									 List *pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
 extern List *truncate_useless_pathkeys(PlannerInfo *root,
 						  RelOptInfo *rel,
 						  List *pathkeys);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 299c9f846a..43e8ef20dc 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -206,6 +206,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
 extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
 					double input_rows, List **pgset);
 
+extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+										double tuples);
+
 extern void estimate_hash_bucket_stats(PlannerInfo *root,
 						   Node *hashkey, double nbuckets,
 						   Selectivity *mcv_freq,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..eb260dfd8b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -193,7 +193,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
 					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 bool randomAccess, bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f56151fc1e..f643422d5b 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE:  drop cascades to table matest1
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
  {3,7,8,10,13,13,16,18,19,22}
 (3 rows)
 
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Merge Append
+   Sort Key: tenk1.thousand, tenk1.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+   ->  Incremental Sort
+         Sort Key: tenk1_1.thousand, tenk1_1.thousand
+         Presorted Key: tenk1_1.thousand
+         ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Merge Append
+   Sort Key: a.thousand, a.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+   ->  Incremental Sort
+         Sort Key: b.unique2, b.unique2
+         Presorted Key: b.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 --
 -- Check handling of a constant-null CHECK constraint
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 84c6e9b5a4..78728f873a 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2347,18 +2347,21 @@ select count(*) from
   left join
   (select * from tenk1 y order by y.unique2) y
   on x.thousand = y.unique2 and x.twothousand = y.hundred and x.fivethous = y.unique2;
-                                    QUERY PLAN                                    
-----------------------------------------------------------------------------------
+                                                  QUERY PLAN                                                  
+--------------------------------------------------------------------------------------------------------------
  Aggregate
    ->  Merge Left Join
-         Merge Cond: (x.thousand = y.unique2)
-         Join Filter: ((x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
+         Merge Cond: ((x.thousand = y.unique2) AND (x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
          ->  Sort
                Sort Key: x.thousand, x.twothousand, x.fivethous
                ->  Seq Scan on tenk1 x
          ->  Materialize
-               ->  Index Scan using tenk1_unique2 on tenk1 y
-(9 rows)
+               ->  Incremental Sort
+                     Sort Key: y.unique2, y.hundred
+                     Presorted Key: y.unique2
+                     ->  Subquery Scan on y
+                           ->  Index Scan using tenk1_unique2 on tenk1 y_1
+(12 rows)
 
 select count(*) from
   (select * from tenk1 x order by x.thousand, x.twothousand, x.fivethous) x
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 76a8209ec2..39c17c6f03 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -176,9 +176,11 @@ EXPLAIN (COSTS OFF)
 SELECT c, sum(a), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 ORDER BY 1, 2, 3;
                               QUERY PLAN                               
 -----------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_p1.c, (sum(pagg_tab_p1.a)), (avg(pagg_tab_p1.b))
-   ->  Append
+   Presorted Key: pagg_tab_p1.c
+   ->  Merge Append
+         Sort Key: pagg_tab_p1.c
          ->  GroupAggregate
                Group Key: pagg_tab_p1.c
                Filter: (avg(pagg_tab_p1.d) < '15'::numeric)
@@ -197,7 +199,7 @@ SELECT c, sum(a), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 O
                ->  Sort
                      Sort Key: pagg_tab_p3.c
                      ->  Seq Scan on pagg_tab_p3
-(21 rows)
+(23 rows)
 
 SELECT c, sum(a), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 ORDER BY 1, 2, 3;
   c   | sum  |         avg         | count 
@@ -215,8 +217,9 @@ EXPLAIN (COSTS OFF)
 SELECT a, sum(b), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 ORDER BY 1, 2, 3;
                               QUERY PLAN                               
 -----------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_p1.a, (sum(pagg_tab_p1.b)), (avg(pagg_tab_p1.b))
+   Presorted Key: pagg_tab_p1.a
    ->  Finalize GroupAggregate
          Group Key: pagg_tab_p1.a
          Filter: (avg(pagg_tab_p1.d) < '15'::numeric)
@@ -237,7 +240,7 @@ SELECT a, sum(b), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 O
                      ->  Sort
                            Sort Key: pagg_tab_p3.a
                            ->  Seq Scan on pagg_tab_p3
-(22 rows)
+(23 rows)
 
 SELECT a, sum(b), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 ORDER BY 1, 2, 3;
  a  | sum  |         avg         | count 
@@ -356,9 +359,11 @@ EXPLAIN (COSTS OFF)
 SELECT c, sum(b order by a) FROM pagg_tab GROUP BY c ORDER BY 1, 2;
                                QUERY PLAN                               
 ------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_p1.c, (sum(pagg_tab_p1.b ORDER BY pagg_tab_p1.a))
-   ->  Append
+   Presorted Key: pagg_tab_p1.c
+   ->  Merge Append
+         Sort Key: pagg_tab_p1.c
          ->  GroupAggregate
                Group Key: pagg_tab_p1.c
                ->  Sort
@@ -374,7 +379,7 @@ SELECT c, sum(b order by a) FROM pagg_tab GROUP BY c ORDER BY 1, 2;
                ->  Sort
                      Sort Key: pagg_tab_p3.c
                      ->  Seq Scan on pagg_tab_p3
-(18 rows)
+(20 rows)
 
 -- Since GROUP BY clause does not match with PARTITION KEY; we need to do
 -- partial aggregation. However, ORDERED SET are not partial safe and thus
@@ -383,8 +388,9 @@ EXPLAIN (COSTS OFF)
 SELECT a, sum(b order by a) FROM pagg_tab GROUP BY a ORDER BY 1, 2;
                                QUERY PLAN                               
 ------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_p1.a, (sum(pagg_tab_p1.b ORDER BY pagg_tab_p1.a))
+   Presorted Key: pagg_tab_p1.a
    ->  GroupAggregate
          Group Key: pagg_tab_p1.a
          ->  Sort
@@ -393,7 +399,7 @@ SELECT a, sum(b order by a) FROM pagg_tab GROUP BY a ORDER BY 1, 2;
                      ->  Seq Scan on pagg_tab_p1
                      ->  Seq Scan on pagg_tab_p2
                      ->  Seq Scan on pagg_tab_p3
-(10 rows)
+(11 rows)
 
 -- JOIN query
 CREATE TABLE pagg_tab1(x int, y int) PARTITION BY RANGE(x);
@@ -487,8 +493,9 @@ EXPLAIN (COSTS OFF)
 SELECT t1.y, sum(t1.x), count(*) FROM pagg_tab1 t1, pagg_tab2 t2 WHERE t1.x = t2.y GROUP BY t1.y HAVING avg(t1.x) > 10 ORDER BY 1, 2, 3;
                                QUERY PLAN                                
 -------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: t1.y, (sum(t1.x)), (count(*))
+   Presorted Key: t1.y
    ->  Finalize GroupAggregate
          Group Key: t1.y
          Filter: (avg(t1.x) > '10'::numeric)
@@ -521,7 +528,7 @@ SELECT t1.y, sum(t1.x), count(*) FROM pagg_tab1 t1, pagg_tab2 t2 WHERE t1.x = t2
                                  ->  Seq Scan on pagg_tab2_p3 t2_2
                                  ->  Hash
                                        ->  Seq Scan on pagg_tab1_p3 t1_2
-(34 rows)
+(35 rows)
 
 SELECT t1.y, sum(t1.x), count(*) FROM pagg_tab1 t1, pagg_tab2 t2 WHERE t1.x = t2.y GROUP BY t1.y HAVING avg(t1.x) > 10 ORDER BY 1, 2, 3;
  y  | sum  | count 
@@ -1068,8 +1075,9 @@ EXPLAIN (COSTS OFF)
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b ORDER BY 1, 2, 3;
                             QUERY PLAN                             
 -------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_ml_p1.b, (sum(pagg_tab_ml_p1.a)), (count(*))
+   Presorted Key: pagg_tab_ml_p1.b
    ->  Finalize GroupAggregate
          Group Key: pagg_tab_ml_p1.b
          ->  Sort
@@ -1090,7 +1098,7 @@ SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b ORDER BY 1, 2, 3;
                      ->  Partial HashAggregate
                            Group Key: pagg_tab_ml_p3_s2.b
                            ->  Seq Scan on pagg_tab_ml_p3_s2
-(22 rows)
+(23 rows)
 
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b HAVING avg(a) < 15 ORDER BY 1, 2, 3;
  b |  sum  | count 
@@ -1159,9 +1167,11 @@ EXPLAIN (COSTS OFF)
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;
                                     QUERY PLAN                                    
 ----------------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_ml_p1.a, (sum(pagg_tab_ml_p1.b)), (count(*))
-   ->  Append
+   Presorted Key: pagg_tab_ml_p1.a
+   ->  Merge Append
+         Sort Key: pagg_tab_ml_p1.a
          ->  Finalize GroupAggregate
                Group Key: pagg_tab_ml_p1.a
                Filter: (avg(pagg_tab_ml_p1.b) < '3'::numeric)
@@ -1200,7 +1210,7 @@ SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER B
                                  ->  Partial HashAggregate
                                        Group Key: pagg_tab_ml_p3_s2.a
                                        ->  Parallel Seq Scan on pagg_tab_ml_p3_s2
-(41 rows)
+(43 rows)
 
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;
  a  | sum  | count 
@@ -1222,8 +1232,9 @@ EXPLAIN (COSTS OFF)
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b ORDER BY 1, 2, 3;
                                  QUERY PLAN                                 
 ----------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_ml_p1.b, (sum(pagg_tab_ml_p1.a)), (count(*))
+   Presorted Key: pagg_tab_ml_p1.b
    ->  Finalize GroupAggregate
          Group Key: pagg_tab_ml_p1.b
          ->  Gather Merge
@@ -1246,7 +1257,7 @@ SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b ORDER BY 1, 2, 3;
                            ->  Partial HashAggregate
                                  Group Key: pagg_tab_ml_p3_s2.b
                                  ->  Parallel Seq Scan on pagg_tab_ml_p3_s2
-(24 rows)
+(25 rows)
 
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b HAVING avg(a) < 15 ORDER BY 1, 2, 3;
  b |  sum  | count 
@@ -1327,8 +1338,9 @@ EXPLAIN (COSTS OFF)
 SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
                                       QUERY PLAN                                      
 --------------------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_para_p1.x, (sum(pagg_tab_para_p1.y)), (avg(pagg_tab_para_p1.y))
+   Presorted Key: pagg_tab_para_p1.x
    ->  Finalize GroupAggregate
          Group Key: pagg_tab_para_p1.x
          Filter: (avg(pagg_tab_para_p1.y) < '7'::numeric)
@@ -1346,7 +1358,7 @@ SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) <
                            ->  Partial HashAggregate
                                  Group Key: pagg_tab_para_p3.x
                                  ->  Parallel Seq Scan on pagg_tab_para_p3
-(19 rows)
+(20 rows)
 
 SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
  x  | sum  |        avg         | count 
@@ -1364,8 +1376,9 @@ EXPLAIN (COSTS OFF)
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
                                       QUERY PLAN                                      
 --------------------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_para_p1.y, (sum(pagg_tab_para_p1.x)), (avg(pagg_tab_para_p1.x))
+   Presorted Key: pagg_tab_para_p1.y
    ->  Finalize GroupAggregate
          Group Key: pagg_tab_para_p1.y
          Filter: (avg(pagg_tab_para_p1.x) < '12'::numeric)
@@ -1383,7 +1396,7 @@ SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) <
                            ->  Partial HashAggregate
                                  Group Key: pagg_tab_para_p3.y
                                  ->  Parallel Seq Scan on pagg_tab_para_p3
-(19 rows)
+(20 rows)
 
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
  y  |  sum  |         avg         | count 
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a19ee08749..9dec75060d 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -88,7 +89,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(16 rows)
+(17 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 9397f72c13..cde4c2ee5a 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
@@ -607,9 +608,26 @@ SELECT
     ORDER BY f.i LIMIT 10)
 FROM generate_series(1, 3) g(i);
 
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 
 --
 -- Check handling of a constant-null CHECK constraint

#69

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Alexander Korotkov (#68)

Re: [HACKERS] [PATCH] Incremental sort

On 03/28/2018 05:12 PM, Alexander Korotkov wrote:

On Wed, Mar 28, 2018 at 4:44 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

On 03/28/2018 03:28 PM, Teodor Sigaev wrote:

BTW, patch had conflicts with master. Please, find rebased version
attached.

Despite by patch conflist patch looks commitable, has anybody objections
to commit it?

Patch recieved several rounds of review during 2 years, and seems to me,
keeping it out from sources may cause a lost it. Although it suggests
performance improvement in rather wide usecases.

No objections from me - if you want me to do one final round of review
after the rebase (not sure how invasive it'll turn out), let me know.

Rebased patch is attached. Incremental sort get used in multiple places
of partition_aggregate regression test. I've checked those cases, and
it seems that incremental sort was selected right.

OK, I'll take a look.

BTW one detail I'd change is name of the GUC variable. enable_incsort
seems unnecessarily terse - let's go for enable_incremental_sort or
something like that.

Already enable_incsort was already renamed to enable_incrementalsort
since [1].

1.
/messages/by-id/CAPpHfduAVmiGDZC+dfNL1rEGu0mt45Rd_mxwjY57uqwWhrvQzg@mail.gmail.com%C2%A0

Ah, apologies. I've been looking at the wrong version.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#70

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: Teodor Sigaev (#66)

Re: [HACKERS] [PATCH] Incremental sort

Teodor Sigaev wrote:

BTW, patch had conflicts with master.ï¿½ Please, find rebased version attached.

Despite by patch conflist patch looks commitable, has anybody objections to
commit it?

Patch recieved several rounds of review during 2 years, and seems to me,
keeping it out from sources may cause a lost it. Although it suggests
performance improvement in rather wide usecases.

Can we have a recap on what the patch *does*? I see there's a
description in Alexander's first email
/messages/by-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
but that was a long time ago, and the patch has likely changed in the
meantime ...

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#71

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Teodor Sigaev (#66)

Re: [HACKERS] [PATCH] Incremental sort

Hi,

On 2018-03-28 16:28:01 +0300, Teodor Sigaev wrote:

BTW, patch had conflicts with master.ï¿½ Please, find rebased version attached.

Despite by patch conflist patch looks commitable, has anybody objections to
commit it?

Patch recieved several rounds of review during 2 years, and seems to me,
keeping it out from sources may cause a lost it. Although it suggests
performance improvement in rather wide usecases.

My impression it has *NOT* received enough review to be RFC. Not saying
it's impossible to get there this release, but that just committing it
doesn't seem wise.

Greetings,

Andres Freund

#72

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Andres Freund (#71)

Re: [HACKERS] [PATCH] Incremental sort

On Wed, Mar 28, 2018 at 7:17 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-03-28 16:28:01 +0300, Teodor Sigaev wrote:

BTW, patch had conflicts with master. Please, find rebased version

attached.

Despite by patch conflist patch looks commitable, has anybody objections

to

commit it?

Patch recieved several rounds of review during 2 years, and seems to me,
keeping it out from sources may cause a lost it. Although it suggests
performance improvement in rather wide usecases.

My impression it has *NOT* received enough review to be RFC. Not saying
it's impossible to get there this release, but that just committing it
doesn't seem wise.

I would say that executor part of this patch already received plenty of
review.
For sure, there still might be issues. I just mean that amount of review of
executor part of this patch is not less than in average patch we commit.
But optimizer and costing part of this patch still need somebody to take
a look at it.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#73

Alexander Kuzmenkov

a.kuzmenkov@postgrespro.ru

almost 8 years ago

In reply to: Alexander Korotkov (#72)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

Hi Alexander,

I took a quick look at the patch. Some things I fixed myself in the
attached patch v.21. Here is the summary:

Typo in compare_fractional_path_costs() should be fixed as a separate patch.
Remove unused function estimate_pathkeys_groups.
Extra MemoryContextReset before tuplesort_end() shouldn't be a big deal,
so we don't have to add a parameter to tuplesoft_free().
Add comment to maincontext declaration.
Fix typo in INITIAL_MEMTUPSIZE.
Remove trailing whitespace.

Some other things I found:

In tuplesort_reset:
if (state->memtupsize < INITIAL_MEMTUPSIZE)
<reallocate memtuples to INITIAL_MEMTUPSIZE>
I'd add a comment explaining when and why we have to do this. Also maybe
a comment to other allocations of memtuples in tuplesort_begin() and
mergeruns(), explaining why it is reallocated and why in maincontext.

In tuplesort_updatemax:
    /* XXX */
    if (spaceUsedOnDisk > state->maxSpaceOnDisk ||
        (spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed >
state->maxSpace))
XXX. Also comparing bools with '>' looks confusing to me.

We should add a comment on top of tuplesort.c, explaining that we now
have a faster way to sort multiple batches of data using the same sort
conditions.

The name 'main context' sounds somewhat vague. Maybe 'top context'? Not
sure.

In ExecSupportBackwardsScan:
case T_IncrementalSort:
return false;
This separate case looks useless, I'd either add a comment explaining
why it can't scan backwards, or just return false by default.

That's all I have for today; tomorrow I'll continue with reviewing the
planner part of the patch.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-21.patchtext/x-patch; name=incremental-sort-21.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 2d6e387d63..d11777cb90 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4d2e43c9f0..729086ee29 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -514,7 +514,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e7d408824e..f2da888056 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3692,6 +3692,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8a58672a94..edd71ae133 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -81,6 +81,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -94,7 +96,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -102,6 +104,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1064,6 +1068,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1674,6 +1681,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2001,15 +2014,38 @@ static void
 show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 {
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
+	int			presortedCols;
+
+	if (IsA(plan, IncrementalSort))
+		presortedCols = ((IncrementalSort *) plan)->presortedCols;
+	else
+		presortedCols = 0;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, presortedCols, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
 /*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
+/*
  * Likewise, for a MergeAppend node.
  */
 static void
@@ -2019,7 +2055,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2043,7 +2079,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2112,7 +2148,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2169,7 +2205,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2182,13 +2218,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2228,9 +2265,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2439,6 +2480,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 }
 
 /*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->groupsCount);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->groupsCount, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		groupsCount;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			groupsCount = incrsortstate->shared_info->sinfo[n].groupsCount;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, groupsCount);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, groupsCount, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
+/*
  * Show information on hash buckets/batches.
  */
 static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..34e05330ea 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,12 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 52f1a96db5..fc3910502b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -281,6 +282,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -494,6 +499,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -978,6 +988,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1227,6 +1240,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 43a27a9af2..17163448a3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 1b1334006f..77013909a8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -373,7 +373,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
-												  NULL, false);
+												  NULL, false, false);
 	}
 
 	aggstate->current_phase = newphase;
@@ -460,7 +460,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, false);
+									 work_mem, NULL, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1f5e41f95a
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,631 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is a specially optimized kind of multikey sort used
+ *		when the input is already presorted by a prefix of the required keys
+ *		list.  Thus, when it's required to sort by (key1, key2 ... keyN) and
+ *		result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
+ *		where values of (key1, key2 ... keyM) are equal.
+ *
+ *		Consider the following example.  We have input tuples consisting from
+ *		two integers (x, y) already presorted by x, while it's required to
+ *		sort them by x and y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 10)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would sort by y following groups, which have
+ *		equal x, individually:
+ *			(1, 5) (1, 2)
+ *			(2, 10) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		following tuple set which is actually sorted by x and y.
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 10)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort is faster than full sort on large datasets.  But
+ *		the case of most huge benefit of incremental sort is queries with
+ *		LIMIT because incremental sort can return first tuples without reading
+ *		whole input dataset.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presortedKeys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presortedKeys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presortedKeys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check if first "presortedCols" sort values are equal.
+ */
+static bool
+cmpSortPresortedCols(IncrementalSortState *node, TupleTableSlot *a,
+															TupleTableSlot *b)
+{
+	int n, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	n = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	for (i = n - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presortedKeys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(a, attno, &isnullA);
+		datumB = slot_getattr(b, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presortedKeys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Copying of tuples to the node->grpPivotSlot introduces some overhead.  It's
+ * especially notable when groups are containing one or few tuples.  In order
+ * to cope this problem we don't copy pivot tuple before the group contains
+ * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
+ * incremental sort, but it reduces the probability of regression.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from sorted set if any.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * If first time through, read all tuples from outer plan and pass them to
+	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	/*
+	 * Initialize tuplesort module.
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "calling tuplesort_begin");
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->groupsCount++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* Put saved tuple to tuplesort if any */
+	if (!TupIsNull(node->grpPivotSlot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->grpPivotSlot);
+		ExecClearTuple(node->grpPivotSlot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/* Put next group of presorted data to the tuplesort */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Save last tuple in minimal group */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->grpPivotSlot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/* Iterate while presorted cols are the same as in saved tuple */
+			if (cmpSortPresortedCols(node, node->grpPivotSlot, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->grpPivotSlot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].groupsCount =
+															node->groupsCount;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->grpPivotSlot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->groupsCount = 0;
+	incrsortstate->presortedKeys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->grpPivotSlot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->grpPivotSlot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..457e774b3d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,9 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c7293a60d7..b93a7a1d43 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -922,6 +922,24 @@ _copyMaterial(const Material *from)
 
 
 /*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
+/*
  * _copySort
  */
 static Sort *
@@ -932,13 +950,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4834,6 +4868,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f61ae03ac5..9d9c90e2be 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -877,12 +877,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -905,6 +903,24 @@ _outSort(StringInfo str, const Sort *node)
 }
 
 static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
+static void
 _outUnique(StringInfo str, const Unique *node)
 {
 	int			i;
@@ -3756,6 +3772,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index fd4586e73d..338bf8b835 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2067,12 +2067,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2081,6 +2082,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2648,6 +2675,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 43f4e75748..c28aa4affb 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3655,6 +3655,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 47729de896..e8cfdd81fd 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1615,6 +1616,13 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  *	  Determines and returns the cost of sorting a relation, including
  *	  the cost of reading the input data.
  *
+ * Sort could be either full sort of relation or incremental sort when we already
+ * have data presorted by some of required pathkeys.  In the second case
+ * we estimate number of groups which source data is divided to by presorted
+ * pathkeys.  And then estimate cost of sorting each individual group assuming
+ * data is divided into group uniformly.  Also, if LIMIT is specified then
+ * we have to pull from source and sort only some of total groups.
+ *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
  * comparisons for t tuples.
@@ -1641,7 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * work that has to be done to prepare the inputs to the comparison operators.
  *
  * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
+ * 'presorted_keys' is a number of pathkeys already presorted in given path
+ * 'input_startup_cost' is the startup cost for reading the input data
+ * 'input_total_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
@@ -1657,19 +1667,28 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
+	Cost		startup_cost = input_startup_cost;
+	Cost		run_cost = 0,
+				rest_cost,
+				group_cost,
+				input_run_cost = input_total_cost - input_startup_cost;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
+	double		num_groups,
+				group_input_bytes,
+				group_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
 	if (!enable_sort)
 		startup_cost += disable_cost;
+	if (!enable_incrementalsort)
+		presorted_keys = 0;
 
 	path->rows = tuples;
 
@@ -1695,13 +1714,56 @@ cost_sort(Path *path, PlannerInfo *root,
 		output_bytes = input_bytes;
 	}
 
-	if (output_bytes > sort_mem_bytes)
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	if (presorted_keys > 0)
+	{
+		List	   *presortedExprs = NIL;
+		ListCell   *l;
+		int			i = 0;
+
+		/* Extract presorted keys as list of expressions */
+		foreach(l, pathkeys)
+		{
+			PathKey *key = (PathKey *)lfirst(l);
+			EquivalenceMember *member = (EquivalenceMember *)
+										linitial(key->pk_eclass->ec_members);
+
+			presortedExprs = lappend(presortedExprs, member->em_expr);
+
+			i++;
+			if (i >= presorted_keys)
+				break;
+		}
+
+		/* Estimate number of groups with equal presorted keys */
+		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
+
+		/*
+		 * Estimate average cost of sorting of one group where presorted keys
+		 * are equal.  Incremental sort is sensitive to distribution of tuples
+		 * to the groups, where we're relying on quite rough assumptions.  Thus,
+		 * we're pessimistic about incremental sort performance and increase
+		 * its average group size by half.
+		 */
+		group_input_bytes = 1.5 * input_bytes / num_groups;
+		group_tuples = 1.5 * tuples / num_groups;
+	}
+	else
+	{
+		num_groups = 1.0;
+		group_input_bytes = input_bytes;
+		group_tuples = tuples;
+	}
+
+	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
-		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = input_bytes / sort_mem_bytes;
+		double		npages = ceil(group_input_bytes / BLCKSZ);
+		double		nruns = group_input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1711,7 +1773,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 
 		/* Disk costs */
 
@@ -1722,10 +1784,10 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		group_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
-	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1733,14 +1795,33 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
-		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		/*
+		 * We'll use plain quicksort on all the input tuples.  If it appears
+		 * that we expect less than two tuples per sort group then assume
+		 * logarithmic part of estimate to be 1.
+		 */
+		if (group_tuples >= 2.0)
+			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
+		else
+			group_cost = comparison_cost * group_tuples;
 	}
 
+	/* Add per group cost of fetching tuples from input */
+	group_cost += input_run_cost / num_groups;
+
+	/*
+	 * We've to sort first group to start output from node. Sorting rest of
+	 * groups are required to return all the other tuples.
+	 */
+	startup_cost += group_cost;
+	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+	if (rest_cost > 0.0)
+		run_cost += rest_cost;
+
 	/*
 	 * Also charge a small amount (arbitrarily set equal to operator cost) per
 	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
@@ -1751,6 +1832,20 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	run_cost += cpu_operator_cost * tuples;
 
+	/* Extra costs of incremental sort */
+	if (presorted_keys > 0)
+	{
+		/*
+		 * In incremental sort case we also have to cost the detection of
+		 * sort groups.  This turns out to be one extra copy and comparison
+		 * per tuple.
+		 */
+		run_cost += (cpu_tuple_cost + comparison_cost) * tuples;
+
+		/* Cost of per group tuplesort reset */
+		run_cost += 2.0 * cpu_tuple_cost * num_groups;
+	}
+
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
@@ -2728,6 +2823,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->pathtarget->width,
@@ -2754,6 +2851,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->pathtarget->width,
@@ -2990,18 +3089,17 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
 	 * inner path is to be used directly (without sorting) and it doesn't
 	 * support mark/restore.
 	 *
-	 * Since the inner side must be ordered, and only Sorts and IndexScans can
-	 * create order to begin with, and they both support mark/restore, you
-	 * might think there's no problem --- but you'd be wrong.  Nestloop and
-	 * merge joins can *preserve* the order of their inputs, so they can be
-	 * selected as the input of a mergejoin, and they don't support
-	 * mark/restore at present.
+	 * Sorts and IndexScans support mark/restore, but IncrementalSorts don't.
+	 * Also Nestloop and merge joins can *preserve* the order of their inputs,
+	 * so they can be selected as the input of a mergejoin, and they don't
+	 * support mark/restore at present.
 	 *
 	 * We don't test the value of enable_material here, because
 	 * materialization is required for correctness in this case, and turning
 	 * it off does not entitle us to deliver an invalid plan.
 	 */
-	else if (innersortkeys == NIL &&
+	else if ((innersortkeys == NIL ||
+			  pathkeys_common(innersortkeys, inner_path->pathkeys) > 0) &&
 			 !ExecSupportsMarkRestore(inner_path))
 		path->materialize_inner = true;
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..57fe52dc98 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,10 +22,12 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -327,6 +329,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL); 
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1580,26 +1627,45 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
  */
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 {
-	if (root->query_pathkeys == NIL)
+	int	n_common_pathkeys;
+
+	if (query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	if (pathkeys_common_contained_in(query_pathkeys, pathkeys, &n_common_pathkeys))
 	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		/* Full match of pathkeys: always useful */
+		return n_common_pathkeys;
+	}
+	else
+	{
+		if (enable_incrementalsort)
+		{
+			/*
+			 * Return the number of path keys in common, or 0 if there are none.
+			 * Any leading common pathkeys could be useful for ordering because
+			 * we can use the incremental sort.
+			 */
+			return n_common_pathkeys;
+		}
+		else
+		{
+			/*
+			 * When incremental sort is disabled, pathkeys are useful only when
+			 * they do contain all the query pathkeys.
+			 */
+			return 0;
+		}
 	}
-
-	return 0;					/* path ordering not useful */
 }
 
 /*
@@ -1615,7 +1681,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
 	int			nuseful2;
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
-	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 8b4f031d96..e047e7736b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -236,7 +236,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -252,10 +252,11 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree);
+						 Plan *lefttree,
+						 int presortedCols);
 static Material *make_material(Plan *lefttree);
 static WindowAgg *make_windowagg(List *tlist, Index winref,
 			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
@@ -443,6 +444,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1128,6 +1130,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		Oid		   *sortOperators;
 		Oid		   *collations;
 		bool	   *nullsFirst;
+		int			n_common_pathkeys;
 
 		/* Build the child plan */
 		/* Must insist that all children return the same tlist */
@@ -1162,9 +1165,11 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (!pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+										  &n_common_pathkeys))
 		{
 			Sort	   *sort = make_sort(subplan, numsortkeys,
+										 n_common_pathkeys,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1514,6 +1519,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 	Plan	   *subplan;
 	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	int			n_common_pathkeys;
 
 	/* As with Gather, it's best to project away columns in the workers. */
 	subplan = create_plan_recurse(root, best_path->subpath, CP_EXACT_TLIST);
@@ -1543,12 +1549,16 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+	if (!pathkeys_common_contained_in(pathkeys, best_path->subpath->pathkeys,
+									  &n_common_pathkeys))
+	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 n_common_pathkeys,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
+	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
@@ -1661,6 +1671,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1670,6 +1681,11 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->presortedCols;
+	else
+		n_common_pathkeys = 0;
+
 	/*
 	 * make_sort_from_pathkeys() indirectly calls find_ec_member_for_tle(),
 	 * which will ignore any child EC members that don't belong to the given
@@ -1678,7 +1694,8 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	 */
 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
 								   IS_OTHER_REL(best_path->subpath->parent) ?
-								   best_path->path.parent->relids : NULL);
+								   best_path->path.parent->relids : NULL,
+								   n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -1922,7 +1939,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
 											 new_grpColIdx,
-											 subplan);
+											 subplan,
+											 0);
 			}
 
 			if (!rollup->is_hashed)
@@ -3870,10 +3888,15 @@ create_mergejoin_plan(PlannerInfo *root,
 	 */
 	if (best_path->outersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		outer_relids = outer_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
-												   best_path->outersortkeys,
-												   outer_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
+									best_path->jpath.outerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
+									   outer_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3884,10 +3907,15 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	if (best_path->innersortkeys)
 	{
+		Sort	   *sort;
+		int			n_common_pathkeys;
 		Relids		inner_relids = inner_path->parent->relids;
-		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
-												   best_path->innersortkeys,
-												   inner_relids);
+
+		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
+									best_path->jpath.innerjoinpath->pathkeys);
+
+		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
+									   inner_relids, n_common_pathkeys);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -4942,8 +4970,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
+	int			presorted_cols = 0;
+
+	if (IsA(plan, IncrementalSort))
+		presorted_cols = ((IncrementalSort *) plan)->presortedCols;
 
-	cost_sort(&sort_path, root, NIL,
+	cost_sort(&sort_path, root, NIL, presorted_cols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -5534,13 +5567,31 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	/* Always use regular sort node when enable_incrementalsort = false */
+	if (!enable_incrementalsort)
+		presortedCols = 0;
+
+	if (presortedCols == 0)
+	{
+		node = makeNode(Sort);
+	}
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->presortedCols = presortedCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5873,9 +5924,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5895,7 +5948,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5938,7 +5991,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5959,7 +6012,8 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 static Sort *
 make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree)
+						 Plan *lefttree,
+						 int presortedCols)
 {
 	List	   *sub_tlist = lefttree->targetlist;
 	ListCell   *l;
@@ -5992,7 +6046,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6657,6 +6711,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..308f60beac 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,6 +44,7 @@
 #include "parser/parse_clause.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a19f5d0c02..ce5b1cc76e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4828,13 +4828,13 @@ create_ordered_paths(PlannerInfo *root,
 	foreach(lc, input_rel->pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
-		bool		is_sorted;
+		int			n_useful_pathkeys;
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+														 path->pathkeys);
+		if (path == cheapest_input_path || n_useful_pathkeys > 0)
 		{
-			if (!is_sorted)
+			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -5966,8 +5966,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->reltarget->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
@@ -6205,14 +6206,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
-			bool		is_sorted;
+			int			n_useful_pathkeys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
-			if (path == cheapest_path || is_sorted)
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+									root->group_pathkeys, path->pathkeys);
+			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
-				if (!is_sorted)
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -6275,12 +6276,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				int			n_useful_pathkeys;
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
-				 * sorting anything but the cheapest path.
+				 * non-incremental sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+				if (n_useful_pathkeys == 0 &&
+					path != partially_grouped_rel->cheapest_total_path)
+					continue;
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 69dd327f0c..08a9545634 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 83008d7661..313cad266f 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2795,6 +2795,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 6e510f9d94..9fca84fde1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1110,7 +1110,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_path->startup_cost;
 	sorted_p.total_cost = input_path->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0,
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_path->rows, input_path->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 22133fcf12..2202e97ee4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1362,12 +1362,14 @@ create_merge_append_path(PlannerInfo *root,
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
+		int			n_common_pathkeys;
 
 		pathnode->path.rows += subpath->rows;
 		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
 			subpath->parallel_safe;
 
-		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
+										 &n_common_pathkeys))
 		{
 			/* Subpath is adequately ordered, we won't need to sort it */
 			input_startup_cost += subpath->startup_cost;
@@ -1381,6 +1383,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  n_common_pathkeys,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->pathtarget->width,
@@ -1628,7 +1632,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  subpath->pathtarget->width,
@@ -1721,6 +1726,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	GatherMergePath *pathnode = makeNode(GatherMergePath);
 	Cost		input_startup_cost = 0;
 	Cost		input_total_cost = 0;
+	int			n_common_pathkeys;
 
 	Assert(subpath->parallel_safe);
 	Assert(pathkeys);
@@ -1737,7 +1743,7 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.pathtarget = target ? target : rel->reltarget;
 	pathnode->path.rows += subpath->rows;
 
-	if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+	if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys, &n_common_pathkeys))
 	{
 		/* Subpath is adequately ordered, we won't need to sort it */
 		input_startup_cost += subpath->startup_cost;
@@ -1751,6 +1757,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		cost_sort(&sort_path,
 				  root,
 				  pathkeys,
+				  n_common_pathkeys,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  subpath->rows,
 				  subpath->pathtarget->width,
@@ -2610,9 +2618,35 @@ create_sort_path(PlannerInfo *root,
 				 List *pathkeys,
 				 double limit_tuples)
 {
-	SortPath   *pathnode = makeNode(SortPath);
+	SortPath   *pathnode;
+	int			n_common_pathkeys;
+
+	/*
+	 * Use incremental sort when it's enabled and there are common pathkeys,
+	 * use regular sort otherwise.
+	 */
+	if (enable_incrementalsort)
+		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+	else
+		n_common_pathkeys = 0;
+
+	if (n_common_pathkeys == 0)
+	{
+		pathnode = makeNode(SortPath);
+		pathnode->path.pathtype = T_Sort;
+	}
+	else
+	{
+		IncrementalSortPath   *incpathnode;
+
+		incpathnode = makeNode(IncrementalSortPath);
+		pathnode = &incpathnode->spath;
+		pathnode->path.pathtype = T_IncrementalSort;
+		incpathnode->presortedCols = n_common_pathkeys;
+	}
+
+	Assert(n_common_pathkeys < list_length(pathkeys));
 
-	pathnode->path.pathtype = T_Sort;
 	pathnode->path.parent = rel;
 	/* Sort doesn't project, so use source path's pathtarget */
 	pathnode->path.pathtarget = subpath->pathtarget;
@@ -2626,7 +2660,9 @@ create_sort_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 
-	cost_sort(&pathnode->path, root, pathkeys,
+	cost_sort(&pathnode->path, root,
+			  pathkeys, n_common_pathkeys,
+			  subpath->startup_cost,
 			  subpath->total_cost,
 			  subpath->rows,
 			  subpath->pathtarget->width,
@@ -2938,7 +2974,8 @@ create_groupingsets_path(PlannerInfo *root,
 			else
 			{
 				/* Account for cost of sort, but don't charge input cost again */
-				cost_sort(&sort_path, root, NIL,
+				cost_sort(&sort_path, root, NIL, 0,
+						  0.0,
 						  0.0,
 						  subpath->rows,
 						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index ed36851fdd..a6e14af9b8 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -295,7 +295,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortNullsFirsts,
 												   work_mem,
 												   NULL,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4ffc8451ca..c8aa384a74 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -861,6 +861,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
 			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e433faad86..83665e0fb2 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,9 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +246,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +658,9 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +696,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,14 +706,22 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
 	 * A dedicated child context used exclusively for caller passed tuples
@@ -715,7 +738,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +763,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +772,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -807,14 +829,15 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate, bool randomAccess)
+					 int workMem, SortCoordinate coordinate,
+					 bool randomAccess, bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -857,7 +880,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -890,7 +913,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1008,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1087,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1130,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1247,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1313,104 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/* XXX */
+	if (spaceUsedOnDisk > state->maxSpaceOnDisk ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2707,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2757,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3139,18 +3255,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6070a42b6f..bf379f7f20 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1807,6 +1807,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1835,6 +1849,45 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						groupsCount;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	PresortedKeyData *presortedKeys;	/* keys, dataset is presorted by */
+	int64		groupsCount;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *grpPivotSlot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 443de22704..4bc270ed6f 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -126,6 +127,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -241,6 +243,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index c922216b7d..974112d086 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -753,6 +753,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index abbbda9e91..86db5098e4 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1524,6 +1524,16 @@ typedef struct SortPath
 } SortPath;
 
 /*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
+
+/*
  * GroupPath represents grouping (of presorted input)
  *
  * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d3269eae71..60edbd996f 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -106,8 +107,9 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 						 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 50e180c554..26787a6221 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
@@ -229,6 +231,7 @@ extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
 extern List *trim_mergeclauses_for_inner_pathkeys(PlannerInfo *root,
 									 List *mergeclauses,
 									 List *pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
 extern List *truncate_useless_pathkeys(PlannerInfo *root,
 						  RelOptInfo *rel,
 						  List *pathkeys);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..eb260dfd8b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -193,7 +193,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
 					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 bool randomAccess, bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f56151fc1e..f643422d5b 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1517,6 +1517,7 @@ NOTICE:  drop cascades to table matest1
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
@@ -1657,9 +1658,45 @@ FROM generate_series(1, 3) g(i);
  {3,7,8,10,13,13,16,18,19,22}
 (3 rows)
 
+set enable_incrementalsort = on;
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Merge Append
+   Sort Key: tenk1.thousand, tenk1.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+   ->  Incremental Sort
+         Sort Key: tenk1_1.thousand, tenk1_1.thousand
+         Presorted Key: tenk1_1.thousand
+         ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
+(7 rows)
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Merge Append
+   Sort Key: a.thousand, a.tenthous
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
+   ->  Incremental Sort
+         Sort Key: b.unique2, b.unique2
+         Presorted Key: b.unique2
+         ->  Index Only Scan using tenk1_unique2 on tenk1 b
+(7 rows)
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 --
 -- Check handling of a constant-null CHECK constraint
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 84c6e9b5a4..78728f873a 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2347,18 +2347,21 @@ select count(*) from
   left join
   (select * from tenk1 y order by y.unique2) y
   on x.thousand = y.unique2 and x.twothousand = y.hundred and x.fivethous = y.unique2;
-                                    QUERY PLAN                                    
-----------------------------------------------------------------------------------
+                                                  QUERY PLAN                                                  
+--------------------------------------------------------------------------------------------------------------
  Aggregate
    ->  Merge Left Join
-         Merge Cond: (x.thousand = y.unique2)
-         Join Filter: ((x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
+         Merge Cond: ((x.thousand = y.unique2) AND (x.twothousand = y.hundred) AND (x.fivethous = y.unique2))
          ->  Sort
                Sort Key: x.thousand, x.twothousand, x.fivethous
                ->  Seq Scan on tenk1 x
          ->  Materialize
-               ->  Index Scan using tenk1_unique2 on tenk1 y
-(9 rows)
+               ->  Incremental Sort
+                     Sort Key: y.unique2, y.hundred
+                     Presorted Key: y.unique2
+                     ->  Subquery Scan on y
+                           ->  Index Scan using tenk1_unique2 on tenk1 y_1
+(12 rows)
 
 select count(*) from
   (select * from tenk1 x order by x.thousand, x.twothousand, x.fivethous) x
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 76a8209ec2..39c17c6f03 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -176,9 +176,11 @@ EXPLAIN (COSTS OFF)
 SELECT c, sum(a), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 ORDER BY 1, 2, 3;
                               QUERY PLAN                               
 -----------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_p1.c, (sum(pagg_tab_p1.a)), (avg(pagg_tab_p1.b))
-   ->  Append
+   Presorted Key: pagg_tab_p1.c
+   ->  Merge Append
+         Sort Key: pagg_tab_p1.c
          ->  GroupAggregate
                Group Key: pagg_tab_p1.c
                Filter: (avg(pagg_tab_p1.d) < '15'::numeric)
@@ -197,7 +199,7 @@ SELECT c, sum(a), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 O
                ->  Sort
                      Sort Key: pagg_tab_p3.c
                      ->  Seq Scan on pagg_tab_p3
-(21 rows)
+(23 rows)
 
 SELECT c, sum(a), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 ORDER BY 1, 2, 3;
   c   | sum  |         avg         | count 
@@ -215,8 +217,9 @@ EXPLAIN (COSTS OFF)
 SELECT a, sum(b), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 ORDER BY 1, 2, 3;
                               QUERY PLAN                               
 -----------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_p1.a, (sum(pagg_tab_p1.b)), (avg(pagg_tab_p1.b))
+   Presorted Key: pagg_tab_p1.a
    ->  Finalize GroupAggregate
          Group Key: pagg_tab_p1.a
          Filter: (avg(pagg_tab_p1.d) < '15'::numeric)
@@ -237,7 +240,7 @@ SELECT a, sum(b), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 O
                      ->  Sort
                            Sort Key: pagg_tab_p3.a
                            ->  Seq Scan on pagg_tab_p3
-(22 rows)
+(23 rows)
 
 SELECT a, sum(b), avg(b), count(*) FROM pagg_tab GROUP BY 1 HAVING avg(d) < 15 ORDER BY 1, 2, 3;
  a  | sum  |         avg         | count 
@@ -356,9 +359,11 @@ EXPLAIN (COSTS OFF)
 SELECT c, sum(b order by a) FROM pagg_tab GROUP BY c ORDER BY 1, 2;
                                QUERY PLAN                               
 ------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_p1.c, (sum(pagg_tab_p1.b ORDER BY pagg_tab_p1.a))
-   ->  Append
+   Presorted Key: pagg_tab_p1.c
+   ->  Merge Append
+         Sort Key: pagg_tab_p1.c
          ->  GroupAggregate
                Group Key: pagg_tab_p1.c
                ->  Sort
@@ -374,7 +379,7 @@ SELECT c, sum(b order by a) FROM pagg_tab GROUP BY c ORDER BY 1, 2;
                ->  Sort
                      Sort Key: pagg_tab_p3.c
                      ->  Seq Scan on pagg_tab_p3
-(18 rows)
+(20 rows)
 
 -- Since GROUP BY clause does not match with PARTITION KEY; we need to do
 -- partial aggregation. However, ORDERED SET are not partial safe and thus
@@ -383,8 +388,9 @@ EXPLAIN (COSTS OFF)
 SELECT a, sum(b order by a) FROM pagg_tab GROUP BY a ORDER BY 1, 2;
                                QUERY PLAN                               
 ------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_p1.a, (sum(pagg_tab_p1.b ORDER BY pagg_tab_p1.a))
+   Presorted Key: pagg_tab_p1.a
    ->  GroupAggregate
          Group Key: pagg_tab_p1.a
          ->  Sort
@@ -393,7 +399,7 @@ SELECT a, sum(b order by a) FROM pagg_tab GROUP BY a ORDER BY 1, 2;
                      ->  Seq Scan on pagg_tab_p1
                      ->  Seq Scan on pagg_tab_p2
                      ->  Seq Scan on pagg_tab_p3
-(10 rows)
+(11 rows)
 
 -- JOIN query
 CREATE TABLE pagg_tab1(x int, y int) PARTITION BY RANGE(x);
@@ -487,8 +493,9 @@ EXPLAIN (COSTS OFF)
 SELECT t1.y, sum(t1.x), count(*) FROM pagg_tab1 t1, pagg_tab2 t2 WHERE t1.x = t2.y GROUP BY t1.y HAVING avg(t1.x) > 10 ORDER BY 1, 2, 3;
                                QUERY PLAN                                
 -------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: t1.y, (sum(t1.x)), (count(*))
+   Presorted Key: t1.y
    ->  Finalize GroupAggregate
          Group Key: t1.y
          Filter: (avg(t1.x) > '10'::numeric)
@@ -521,7 +528,7 @@ SELECT t1.y, sum(t1.x), count(*) FROM pagg_tab1 t1, pagg_tab2 t2 WHERE t1.x = t2
                                  ->  Seq Scan on pagg_tab2_p3 t2_2
                                  ->  Hash
                                        ->  Seq Scan on pagg_tab1_p3 t1_2
-(34 rows)
+(35 rows)
 
 SELECT t1.y, sum(t1.x), count(*) FROM pagg_tab1 t1, pagg_tab2 t2 WHERE t1.x = t2.y GROUP BY t1.y HAVING avg(t1.x) > 10 ORDER BY 1, 2, 3;
  y  | sum  | count 
@@ -1068,8 +1075,9 @@ EXPLAIN (COSTS OFF)
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b ORDER BY 1, 2, 3;
                             QUERY PLAN                             
 -------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_ml_p1.b, (sum(pagg_tab_ml_p1.a)), (count(*))
+   Presorted Key: pagg_tab_ml_p1.b
    ->  Finalize GroupAggregate
          Group Key: pagg_tab_ml_p1.b
          ->  Sort
@@ -1090,7 +1098,7 @@ SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b ORDER BY 1, 2, 3;
                      ->  Partial HashAggregate
                            Group Key: pagg_tab_ml_p3_s2.b
                            ->  Seq Scan on pagg_tab_ml_p3_s2
-(22 rows)
+(23 rows)
 
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b HAVING avg(a) < 15 ORDER BY 1, 2, 3;
  b |  sum  | count 
@@ -1159,9 +1167,11 @@ EXPLAIN (COSTS OFF)
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;
                                     QUERY PLAN                                    
 ----------------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_ml_p1.a, (sum(pagg_tab_ml_p1.b)), (count(*))
-   ->  Append
+   Presorted Key: pagg_tab_ml_p1.a
+   ->  Merge Append
+         Sort Key: pagg_tab_ml_p1.a
          ->  Finalize GroupAggregate
                Group Key: pagg_tab_ml_p1.a
                Filter: (avg(pagg_tab_ml_p1.b) < '3'::numeric)
@@ -1200,7 +1210,7 @@ SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER B
                                  ->  Partial HashAggregate
                                        Group Key: pagg_tab_ml_p3_s2.a
                                        ->  Parallel Seq Scan on pagg_tab_ml_p3_s2
-(41 rows)
+(43 rows)
 
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;
  a  | sum  | count 
@@ -1222,8 +1232,9 @@ EXPLAIN (COSTS OFF)
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b ORDER BY 1, 2, 3;
                                  QUERY PLAN                                 
 ----------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_ml_p1.b, (sum(pagg_tab_ml_p1.a)), (count(*))
+   Presorted Key: pagg_tab_ml_p1.b
    ->  Finalize GroupAggregate
          Group Key: pagg_tab_ml_p1.b
          ->  Gather Merge
@@ -1246,7 +1257,7 @@ SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b ORDER BY 1, 2, 3;
                            ->  Partial HashAggregate
                                  Group Key: pagg_tab_ml_p3_s2.b
                                  ->  Parallel Seq Scan on pagg_tab_ml_p3_s2
-(24 rows)
+(25 rows)
 
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b HAVING avg(a) < 15 ORDER BY 1, 2, 3;
  b |  sum  | count 
@@ -1327,8 +1338,9 @@ EXPLAIN (COSTS OFF)
 SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
                                       QUERY PLAN                                      
 --------------------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_para_p1.x, (sum(pagg_tab_para_p1.y)), (avg(pagg_tab_para_p1.y))
+   Presorted Key: pagg_tab_para_p1.x
    ->  Finalize GroupAggregate
          Group Key: pagg_tab_para_p1.x
          Filter: (avg(pagg_tab_para_p1.y) < '7'::numeric)
@@ -1346,7 +1358,7 @@ SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) <
                            ->  Partial HashAggregate
                                  Group Key: pagg_tab_para_p3.x
                                  ->  Parallel Seq Scan on pagg_tab_para_p3
-(19 rows)
+(20 rows)
 
 SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
  x  | sum  |        avg         | count 
@@ -1364,8 +1376,9 @@ EXPLAIN (COSTS OFF)
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
                                       QUERY PLAN                                      
 --------------------------------------------------------------------------------------
- Sort
+ Incremental Sort
    Sort Key: pagg_tab_para_p1.y, (sum(pagg_tab_para_p1.x)), (avg(pagg_tab_para_p1.x))
+   Presorted Key: pagg_tab_para_p1.y
    ->  Finalize GroupAggregate
          Group Key: pagg_tab_para_p1.y
          Filter: (avg(pagg_tab_para_p1.x) < '12'::numeric)
@@ -1383,7 +1396,7 @@ SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) <
                            ->  Partial HashAggregate
                                  Group Key: pagg_tab_para_p3.y
                                  ->  Parallel Seq Scan on pagg_tab_para_p3
-(19 rows)
+(20 rows)
 
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
  y  |  sum  |         avg         | count 
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a19ee08749..9dec75060d 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -88,7 +89,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(16 rows)
+(17 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 9397f72c13..cde4c2ee5a 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -546,6 +546,7 @@ drop table matest0 cascade;
 set enable_seqscan = off;
 set enable_indexscan = on;
 set enable_bitmapscan = off;
+set enable_incrementalsort = off;
 
 -- Check handling of duplicated, constant, or volatile targetlist items
 explain (costs off)
@@ -607,9 +608,26 @@ SELECT
     ORDER BY f.i LIMIT 10)
 FROM generate_series(1, 3) g(i);
 
+set enable_incrementalsort = on;
+
+-- check incremental sort is used when enabled
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+UNION ALL
+SELECT thousand, thousand FROM tenk1
+ORDER BY thousand, tenthous;
+
+explain (costs off)
+SELECT x, y FROM
+  (SELECT thousand AS x, tenthous AS y FROM tenk1 a
+   UNION ALL
+   SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
+ORDER BY x, y;
+
 reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
+reset enable_incrementalsort;
 
 --
 -- Check handling of a constant-null CHECK constraint

#74

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Alexander Kuzmenkov (#73)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

Hi,

I've been doing a bit more review of the patch today, focusing on the
planner part, and I'm starting to have some doubts regarding the way
incremental sort paths are created. I do have some question about the
executor and other parts too.

I'll mark this as 'waiting on author' to make it clear the patch is
still being discussed, RFC is not appropriate status for that.

Attached is a patch that highlights some of the interesting places, and
also suggests some minor changes to comments and other tweaks.

1) planning/costing of incremental vs. non-incremental sorts
------------------------------------------------------------

In sort, all the various places that create/cost sorts:

* createplan.c (make_sort)
* planner.c (create_sort_path)
* pathnode.c (cost_sort)

seem to prefer incremental sorts whenever available. Consider for
example this code from create_merge_append_plan():

if (!pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
&n_common_pathkeys))
{
Sort *sort = make_sort(subplan, numsortkeys,
n_common_pathkeys,
sortColIdx, sortOperators,
collations, nullsFirst);

label_sort_with_costsize(root, sort, best_path->limit_tuples);
subplan = (Plan *) sort;
}

This essentially says that when (n_common_pathkeys > 0), the sort is
going to be incremental.

That however seems to rely on an important assumption - when the input
is presorted, the incremental sort is expected to be cheaper than
regular sort.

This assumption however seems to be proven invalid by cost_sort, which
does the common part for both sort modes (incremental/non-incremental)
first, and then does this:

/* Extra costs of incremental sort */
if (presorted_keys > 0)
{
... add something to the costs ...
}

That is, the incremental cost seems to be pretty much guaranteed to be
more expensive than regular Sort (with the exception of LIMIT queries,
where it's guaranteed to win thanks to lower startup cost).

I don't know how significant the cost difference may be (perhaps not
much), or if it may lead to inefficient plans. For example what if the
cheapest total path is partially sorted by chance, and only has a single
prefix group? Then all the comparisons with pivotSlot are unnecessary.

But I'm pretty sure it may lead to surprising behavior - for example if
you disable incremental sorts (enable_incrementalsort=off), the plan
will switch to plain sort without the additional costs. So you'll get a
cheaper plan by disabling some operation. That's surprising.

So I think it would be more appropriate if those places actually did a
costing of incremental vs. non-incremental sorts, and then constructed
the cheaper option. Essentially we should consider both plain and
incremental sort for each partially sorted input path, and then pick the
right one.

Of course, this is going to be tricky in createplan.c which builds the
plans directly - in that case it might be integrated into make_sort() or
something like that.

Also, I wonder if we could run into problems, due to incremental sort
not supporting things the regular sort does (rewind, backwards scans and
mark/restore).

2) nodeIncrementalsort.c
------------------------

There's a couple of obsolete comments, that came from nodeSort.c and did
not get tweaked (and so talk about first-time through when incremental
sort needs to do that for each group, etc.). The attached diff tweaks
those, and clarifies a couple of others. I've also added some comments
explaining what the pivotSlot is about etc. There's also a couple of XXX
comments asking additional questions/clarifications.

I'm wondering if a static MIN_GROUP_SIZE is good idea. For example, what
if the subplan is expected to return only very few tuples (say, 33), but
the query includes LIMIT 1. Now, let's assume the startup/total cost of
the subplan is 1 and 1000000. With MIN_GROUP_SIZE 32 we're bound to
execute it pretty much till the end, while we could terminate after the
first tuple (if the prefix changes).

So I think we should use a Min(limit,MIN_GROUP_SIZE) here, and perhaps
this should depend on average group size too.

The other questionable thing seems to be this claim:

* We unlikely will have huge groups with incremental sort. Therefore
* usage of abbreviated keys would be likely a waste of time.

followed by disabling abbreviated keys in the tuplesort_begin_heap call.
I find this rather dubious and unsupported by any arguments (I certainly
don't see any in the comments).

If would be more acceptable if the estimated number of groups was used
when deciding whether to use incremental sort or not, but that's not the
case - as explained in the first part, we simply prefer incremental
sorts whenever there is a prefix. In those cases we have very little
idea (or even guarantees) regarding the average group size.

Furthermore, cost_sort is estimating the number of groups, so it does
know the average group size. I don't see why we couldn't consider it
here too, and disable/enable abbreviated keys depending on that.

3) pathkeys.c
-------------

The new function pathkeys_useful_for_ordering() does actually change
behavior depending on enable_incrementalsort. That seems like a rather
bad idea, for a couple or reasons.

AFAICS pathkeys.c is supposed to provide generic utils for work with
pathkeys, and no one would expect the functions to change behavior
depending on enable_* GUCs. I certainly would not.

In short, this does not seem like the right place to enable/disable
incremental sorts, that should be done when costing the plan (i.e. in
costsize.c) or creating the plan.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-comments.patchtext/x-patch; name=0001-comments.patchDownload

From b4c2a801aa802e71b8a822f5e1cd163463e97e26 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sat, 31 Mar 2018 19:49:38 +0200
Subject: [PATCH] comments

---
 src/backend/commands/explain.c             |   4 +
 src/backend/executor/nodeIncrementalSort.c | 118 ++++++++++++++++++++---------
 src/backend/optimizer/path/costsize.c      |  54 ++++++++-----
 src/backend/optimizer/path/pathkeys.c      |   7 +-
 src/backend/optimizer/plan/createplan.c    |  14 ++++
 src/backend/optimizer/plan/planagg.c       |   1 -
 src/backend/optimizer/plan/planner.c       |   6 +-
 src/backend/optimizer/util/pathnode.c      |   5 ++
 8 files changed, 155 insertions(+), 54 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index edd71ae..8533f36 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2016,6 +2016,10 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 	int			presortedCols;
 
+	/*
+	 * XXX This seems unnecessary. In which case do we call show_sort_keys
+	 * for incremental sort? Even the comment says it's for Sort only.
+	 */
 	if (IsA(plan, IncrementalSort))
 		presortedCols = ((IncrementalSort *) plan)->presortedCols;
 	else
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 1f5e41f..b95379b 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -5,45 +5,49 @@
  *
  * DESCRIPTION
  *
- *		Incremental sort is a specially optimized kind of multikey sort used
- *		when the input is already presorted by a prefix of the required keys
- *		list.  Thus, when it's required to sort by (key1, key2 ... keyN) and
- *		result is already sorted by (key1, key2 ... keyM), M < N, we sort groups
- *		where values of (key1, key2 ... keyM) are equal.
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *      when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *      divide the input into groups where keys (key1, ... keyM) are equal,
+ *      and only sort on the remaining columns.
  *
- *		Consider the following example.  We have input tuples consisting from
- *		two integers (x, y) already presorted by x, while it's required to
- *		sort them by x and y.  Let input tuples be following.
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
  *
  *		(1, 5)
  *		(1, 2)
- *		(2, 10)
+ *		(2, 9)
  *		(2, 1)
  *		(2, 5)
  *		(3, 3)
  *		(3, 7)
  *
- *		Incremental sort algorithm would sort by y following groups, which have
- *		equal x, individually:
+ *		Incremental sort algorithm would split the input into the following
+ *      groups, which have equal X, and then sort them by Y individually:
+ *
  *			(1, 5) (1, 2)
- *			(2, 10) (2, 1) (2, 5)
+ *			(2, 9) (2, 1) (2, 5)
  *			(3, 3) (3, 7)
  *
  *		After sorting these groups and putting them altogether, we would get
- *		following tuple set which is actually sorted by x and y.
+ *		the following result which is sorted by X and Y, as requested:
  *
  *		(1, 2)
  *		(1, 5)
  *		(2, 1)
  *		(2, 5)
- *		(2, 10)
+ *		(2, 9)
  *		(3, 3)
  *		(3, 7)
  *
- *		Incremental sort is faster than full sort on large datasets.  But
- *		the case of most huge benefit of incremental sort is queries with
- *		LIMIT because incremental sort can return first tuples without reading
- *		whole input dataset.
+ *		Incremental sort may be more efficient than plain sort, parcitularly
+ *      on large datasets, as it reduces the amount of data to sort at once,
+ *      making it more likely it fits into work_mem (eliminating the need to
+ *      spill to disk).  But the main advantage of incremental sort is that
+ *      it can start producing rows early, before sorting the whole dataset,
+ *      which is a significant benefit especially for queries with LIMIT.
  *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -112,18 +116,27 @@ preparePresortedCols(IncrementalSortState *node)
 
 /*
  * Check if first "presortedCols" sort values are equal.
+ *
+ * XXX I find the name somewhat confusing, because "cmp" is usually used
+ * for comparators, i.e. functions that return -1/0/1. I suggest something
+ * like "samePrefixGroup" or something like that.
  */
 static bool
 cmpSortPresortedCols(IncrementalSortState *node, TupleTableSlot *a,
 															TupleTableSlot *b)
 {
-	int n, i;
+	int presortedCols, i;
 
 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
 
-	n = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
 
-	for (i = n - 1; i >= 0; i--)
+	/*
+	 * We do assume the input is sorted by keys (0, ... n), which means
+	 * the tail keys are more likely to change. So we do the comparison
+	 * from the end, to minimize the number of function calls.
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
 	{
 		Datum				datumA,
 							datumB,
@@ -171,6 +184,12 @@ cmpSortPresortedCols(IncrementalSortState *node, TupleTableSlot *a,
  * to cope this problem we don't copy pivot tuple before the group contains
  * at least MIN_GROUP_SIZE of tuples.  Surely, it might reduce efficiency of
  * incremental sort, but it reduces the probability of regression.
+ *
+ * XXX I suppose this is not just about copying the tuples into the slot, but
+ * about frequently sorting tiny amounts of data (tuplesort overhead)?
+ *
+ * XXX Fixed-size limit seems like a bad idea, for example when the subplan
+ * is expected to produce only very few tuples (less than 32) at high cost.
  */
 #define MIN_GROUP_SIZE 32
 
@@ -205,6 +224,9 @@ ExecIncrementalSort(PlanState *pstate)
 	TupleDesc			tupDesc;
 	int64				nTuples = 0;
 
+	/* XXX ExecSort does this, I don't see why ExecIncrementalSort shouldn't */
+	CHECK_FOR_INTERRUPTS();
+
 	/*
 	 * get state info from node
 	 */
@@ -216,7 +238,9 @@ ExecIncrementalSort(PlanState *pstate)
 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
 
 	/*
-	 * Return next tuple from sorted set if any.
+	 * Return next tuple from the current sorted group set if available.
+	 * If there are no more tuples in the current group, we need to try
+	 * to fetch more tuples from the input and build another group.
 	 */
 	if (node->sort_Done)
 	{
@@ -228,8 +252,10 @@ ExecIncrementalSort(PlanState *pstate)
 	}
 
 	/*
-	 * If first time through, read all tuples from outer plan and pass them to
-	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+	 * First time through or no tuples in the current group. Read next
+	 * batch of tuples from the outer plan and pass them to tuplesort.c.
+	 * Subsequent calls just fetch tuples from tuplesort, until the group
+	 * is exhausted, at which point we build the next group.
 	 */
 
 	SO1_printf("ExecIncrementalSort: %s\n",
@@ -241,15 +267,12 @@ ExecIncrementalSort(PlanState *pstate)
 	 */
 	estate->es_direction = ForwardScanDirection;
 
-	/*
-	 * Initialize tuplesort module.
-	 */
-	SO1_printf("ExecIncrementalSort: %s\n",
-			   "calling tuplesort_begin");
-
 	outerNode = outerPlanState(node);
 	tupDesc = ExecGetResultType(outerNode);
 
+	/*
+	 * Initialize tuplesort module (needed only before the first group).
+	 */
 	if (node->tuplesortstate == NULL)
 	{
 		/*
@@ -259,12 +282,17 @@ ExecIncrementalSort(PlanState *pstate)
 		 */
 		preparePresortedCols(node);
 
+		SO1_printf("ExecIncrementalSort: %s\n",
+				   "calling tuplesort_begin_heap");
+
 		/*
 		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
 		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
 		 * necessary have equal value of the first column.  We unlikely will
 		 * have huge groups with incremental sort.  Therefore usage of
 		 * abbreviated keys would be likely a waste of time.
+		 *
+		 * XXX The claim about abbreviated keys seems rather dubious, IMHO.
 		 */
 		tuplesortstate = tuplesort_begin_heap(
 									tupDesc,
@@ -290,7 +318,7 @@ ExecIncrementalSort(PlanState *pstate)
 	if (node->bounded)
 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
 
-	/* Put saved tuple to tuplesort if any */
+	/* If we got a left-over tuple from the last group, pass it to tuplesort. */
 	if (!TupIsNull(node->grpPivotSlot))
 	{
 		tuplesort_puttupleslot(tuplesortstate, node->grpPivotSlot);
@@ -312,19 +340,41 @@ ExecIncrementalSort(PlanState *pstate)
 			break;
 		}
 
-		/* Put next group of presorted data to the tuplesort */
+		/*
+		 * Accumulate the next group of presorted tuples for tuplesort.
+		 * We always accumulate at least MIN_GROUP_SIZE tuples, and only
+		 * then we start to compare the prefix keys.
+		 *
+		 * The last tuple is kept as a pivot, so that we can determine if
+		 * the subsequent tuples have the same prefix key (same group).
+		 */
 		if (nTuples < MIN_GROUP_SIZE)
 		{
 			tuplesort_puttupleslot(tuplesortstate, slot);
 
-			/* Save last tuple in minimal group */
+			/* Keep the last tuple in minimal group as a pivot. */
 			if (nTuples == MIN_GROUP_SIZE - 1)
 				ExecCopySlot(node->grpPivotSlot, slot);
 			nTuples++;
 		}
 		else
 		{
-			/* Iterate while presorted cols are the same as in saved tuple */
+			/*
+			 * Iterate while presorted cols are the same as in saved tuple
+			 *
+			 * After accumulating at least MIN_GROUP_SIZE tuples (we don't
+			 * know how many groups are there in that set), we need to keep
+			 * accumulating until we reach the end of the group. Only then
+			 * we can do the sort and output all the tuples.
+			 *
+			 * We compare the prefix keys to the pivot - if the prefix keys
+			 * are the same the tuple belongs to the same group, so we pass
+			 * it to the tuplesort.
+			 *
+			 * If the prefix differs, we've reached the end of the group. We
+			 * need to keep the last tuple, so we copy it into the pivot slot
+			 * (it does not serve as pivot, though).
+			 */
 			if (cmpSortPresortedCols(node, node->grpPivotSlot, slot))
 			{
 				tuplesort_puttupleslot(tuplesortstate, slot);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index e8cfdd8..52956a8 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1616,13 +1616,6 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  *	  Determines and returns the cost of sorting a relation, including
  *	  the cost of reading the input data.
  *
- * Sort could be either full sort of relation or incremental sort when we already
- * have data presorted by some of required pathkeys.  In the second case
- * we estimate number of groups which source data is divided to by presorted
- * pathkeys.  And then estimate cost of sorting each individual group assuming
- * data is divided into group uniformly.  Also, if LIMIT is specified then
- * we have to pull from source and sort only some of total groups.
- *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
  * comparisons for t tuples.
@@ -1648,6 +1641,16 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
+ * The sort may also be incremental, when the input data is already sorted
+ * by a prefix of the requested pathkeys.  In that case we estimate the
+ * number of groups the input data is divided into (by the prefix keys), and
+ * then apply the same costing criteria as for regular sort.  For example the
+ * sort_mem limit is applied on per-group size (assuming average group size),
+ * not the total volume of data.
+ *
+ * If LIMIT is specified, incremental sort only needs to pull and sort
+ * a subset of the input data, unlike the regular sort.
+ *
  * 'pathkeys' is a list of sort keys
  * 'presorted_keys' is a number of pathkeys already presorted in given path
  * 'input_startup_cost' is the startup cost for reading the input data
@@ -1687,6 +1690,7 @@ cost_sort(Path *path, PlannerInfo *root,
 
 	if (!enable_sort)
 		startup_cost += disable_cost;
+
 	if (!enable_incrementalsort)
 		presorted_keys = 0;
 
@@ -1749,6 +1753,19 @@ cost_sort(Path *path, PlannerInfo *root,
 		 */
 		group_input_bytes = 1.5 * input_bytes / num_groups;
 		group_tuples = 1.5 * tuples / num_groups;
+
+		/*
+		 * We want to be sure the cost of a sort is never estimated as zero, even
+		 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+		 *
+		 * XXX Same protection as for tuples, at the beginning. And we don't need
+		 * to worry about LOG2() below.
+		 *
+		 * XXX Probably should re-evaluate group_input_bytes, but the difference
+		 * is going to be tiny.
+		 */
+		if (group_tuples < 2.0)
+			group_tuples = 2.0;
 	}
 	else
 	{
@@ -1757,6 +1774,10 @@ cost_sort(Path *path, PlannerInfo *root,
 		group_tuples = tuples;
 	}
 
+	/*
+	 * XXX Can it actually happen that the first condition is true,
+	 * while the second one is false? I don't think so.
+	 */
 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
@@ -1799,15 +1820,8 @@ cost_sort(Path *path, PlannerInfo *root,
 	}
 	else
 	{
-		/*
-		 * We'll use plain quicksort on all the input tuples.  If it appears
-		 * that we expect less than two tuples per sort group then assume
-		 * logarithmic part of estimate to be 1.
-		 */
-		if (group_tuples >= 2.0)
-			group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
-		else
-			group_cost = comparison_cost * group_tuples;
+		/* We'll use plain quicksort on all the input tuples. */
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 	}
 
 	/* Add per group cost of fetching tuples from input */
@@ -1832,7 +1846,13 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	run_cost += cpu_operator_cost * tuples;
 
-	/* Extra costs of incremental sort */
+	/*
+	 * Extra costs of incremental sort
+	 *
+	 * XXX This pretty much implies there are cases where incremental sort
+	 * is costed as more expensive than plain sort. The difference may be
+	 * fairly small, but if the groups are tiny it may be noticeable.
+	 */
 	if (presorted_keys > 0)
 	{
 		/*
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 57fe52d..506d898 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -27,7 +27,6 @@
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
-#include "utils/selfuncs.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -1648,6 +1647,12 @@ pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 	}
 	else
 	{
+		/*
+		 * XXX This seems really strange. Why should we even consider the GUC
+		 * here? It's supposed to be a generic utility function. That belongs
+		 * to costsize.c only, I think. No other function here does something
+		 * like that, which is illustrated by having to include cost.h.
+		 */
 		if (enable_incrementalsort)
 		{
 			/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9ca06c7..74230ec 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1180,6 +1180,13 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
+
+		/*
+		 * XXX This seems wrong, as it builds an incremental sort whenever
+		 * there's at least one prefix column, even when a regular sort
+		 * would be cheaper.
+		 */
+
 		if (!pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
 										  &n_common_pathkeys))
 		{
@@ -1564,6 +1571,13 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
+
+	/*
+	 * XXX This seems wrong, as it builds an incremental sort whenever
+	 * there's at least one prefix column, even when a regular sort
+	 * would be cheaper.
+	 */
+
 	if (!pathkeys_common_contained_in(pathkeys, best_path->subpath->pathkeys,
 									  &n_common_pathkeys))
 	{
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 308f60b..95cbffb 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -44,7 +44,6 @@
 #include "parser/parse_clause.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
-#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 39dabce..fa312e9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6209,7 +6209,11 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 									root->group_pathkeys, path->pathkeys);
 			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
-				/* Sort the cheapest-total path if it isn't already sorted */
+				/* Sort the cheapest-total path if it isn't already sorted
+				 *
+				 * XXX This comment is obviously stale, as it's not about
+				 * cheapest-total path, but about paths sorted by prefix.
+				 */
 				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 2202e97..62f39c0 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1368,6 +1368,11 @@ create_merge_append_path(PlannerInfo *root,
 		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
 			subpath->parallel_safe;
 
+
+		/*
+		 * XXX Same issue as in createplan.c - we always create incremental
+		 * sort, even if plain sort would be cheaper.
+		 */
 		if (pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
 										 &n_common_pathkeys))
 		{
-- 
2.9.5

#75

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Tomas Vondra (#74)

Re: [HACKERS] [PATCH] Incremental sort

On 03/31/2018 10:43 PM, Tomas Vondra wrote:

...
But I'm pretty sure it may lead to surprising behavior - for example if
you disable incremental sorts (enable_incrementalsort=off), the plan
will switch to plain sort without the additional costs. So you'll get a
cheaper plan by disabling some operation. That's surprising.

To illustrate this is a valid issue, consider this trivial example:

create table t (a int, b int, c int);

insert into t select 10*random(), 10*random(), 10*random()
from generate_series(1,1000000) s(i);

analyze t;

explain select * from (select * from t order by a,b) foo order by a,b,c;

QUERY PLAN
------------------------------------------------------------------------
Incremental Sort (cost=133100.48..264139.27 rows=1000000 width=12)
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: t.a, t.b
-> Seq Scan on t (cost=0.00..15406.00 rows=1000000 width=12)
(6 rows)

set enable_incrementalsort = off;

explain select * from (select * from t order by a,b) foo order by a,b,c;
QUERY PLAN
------------------------------------------------------------------------
Sort (cost=261402.69..263902.69 rows=1000000 width=12)
Sort Key: t.a, t.b, t.c
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: t.a, t.b
-> Seq Scan on t (cost=0.00..15406.00 rows=1000000 width=12)
(5 rows)

So the cost with incremental sort was 264139, and after disabling the
incremental cost it dropped to 263902. Granted, the difference is
negligible in this case, but it's still surprising.

Also, it can be made much more significant by reducing the number of
prefix groups in the data:

truncate t;

insert into t select 1,1,1 from generate_series(1,1000000) s(i);

analyze t;

set enable_incrementalsort = on;

explain select * from (select * from t order by a,b) foo order by a,b,c;

QUERY PLAN
------------------------------------------------------------------------
Incremental Sort (cost=324165.83..341665.85 rows=1000000 width=12)
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: t.a, t.b
-> Seq Scan on t (cost=0.00..15406.00 rows=1000000 width=12)
(6 rows)

So that's 263902 vs. 341665, yet we still prefer the incremental mode.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#76

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Tomas Vondra (#75)

Re: [HACKERS] [PATCH] Incremental sort

On Sun, Apr 1, 2018 at 12:06 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 03/31/2018 10:43 PM, Tomas Vondra wrote:

...
But I'm pretty sure it may lead to surprising behavior - for example if
you disable incremental sorts (enable_incrementalsort=off), the plan
will switch to plain sort without the additional costs. So you'll get a
cheaper plan by disabling some operation. That's surprising.

To illustrate this is a valid issue, consider this trivial example:

create table t (a int, b int, c int);

insert into t select 10*random(), 10*random(), 10*random()
from generate_series(1,1000000) s(i);

analyze t;

explain select * from (select * from t order by a,b) foo order by a,b,c;

QUERY PLAN
------------------------------------------------------------------------
Incremental Sort (cost=133100.48..264139.27 rows=1000000 width=12)
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: t.a, t.b
-> Seq Scan on t (cost=0.00..15406.00 rows=1000000 width=12)
(6 rows)

set enable_incrementalsort = off;

explain select * from (select * from t order by a,b) foo order by a,b,c;
QUERY PLAN
------------------------------------------------------------------------
Sort (cost=261402.69..263902.69 rows=1000000 width=12)
Sort Key: t.a, t.b, t.c
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: t.a, t.b
-> Seq Scan on t (cost=0.00..15406.00 rows=1000000 width=12)
(5 rows)

So the cost with incremental sort was 264139, and after disabling the
incremental cost it dropped to 263902. Granted, the difference is
negligible in this case, but it's still surprising.

Also, it can be made much more significant by reducing the number of
prefix groups in the data:

truncate t;

insert into t select 1,1,1 from generate_series(1,1000000) s(i);

analyze t;

set enable_incrementalsort = on;

explain select * from (select * from t order by a,b) foo order by a,b,c;

QUERY PLAN
------------------------------------------------------------------------
Incremental Sort (cost=324165.83..341665.85 rows=1000000 width=12)
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: t.a, t.b
-> Seq Scan on t (cost=0.00..15406.00 rows=1000000 width=12)
(6 rows)

So that's 263902 vs. 341665, yet we still prefer the incremental mode.

Problem is well-defined, thank you.
I'll check what can be done in this field today.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#77

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Alexander Korotkov (#76)

Re: [HACKERS] [PATCH] Incremental sort

On 04/03/2018 11:09 AM, Alexander Korotkov wrote:

On Sun, Apr 1, 2018 at 12:06 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

On 03/31/2018 10:43 PM, Tomas Vondra wrote:

...
But I'm pretty sure it may lead to surprising behavior - for example if
you disable incremental sorts (enable_incrementalsort=off), the plan
will switch to plain sort without the additional costs. So you'll get a
cheaper plan by disabling some operation. That's surprising.

To illustrate this is a valid issue, consider this trivial example:

create table t (a int, b int, c int);

insert into t select 10*random(), 10*random(), 10*random()
from generate_series(1,1000000) s(i);

analyze t;

explain select * from (select * from t order by a,b) foo order by a,b,c;

QUERY PLAN
------------------------------------------------------------------------
Incremental Sort (cost=133100.48..264139.27 rows=1000000 width=12)
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: t.a, t.b
-> Seq Scan on t (cost=0.00..15406.00 rows=1000000 width=12)
(6 rows)

set enable_incrementalsort = off;

explain select * from (select * from t order by a,b) foo order by a,b,c;
QUERY PLAN
------------------------------------------------------------------------
Sort (cost=261402.69..263902.69 rows=1000000 width=12)
Sort Key: t.a, t.b, t.c
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: t.a, t.b
-> Seq Scan on t (cost=0.00..15406.00 rows=1000000 width=12)
(5 rows)

So the cost with incremental sort was 264139, and after disabling the
incremental cost it dropped to 263902. Granted, the difference is
negligible in this case, but it's still surprising.

Also, it can be made much more significant by reducing the number of
prefix groups in the data:

truncate t;

insert into t select 1,1,1 from generate_series(1,1000000) s(i);

analyze t;

set enable_incrementalsort = on;

explain select * from (select * from t order by a,b) foo order by a,b,c;

QUERY PLAN
------------------------------------------------------------------------
Incremental Sort (cost=324165.83..341665.85 rows=1000000 width=12)
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: t.a, t.b
-> Seq Scan on t (cost=0.00..15406.00 rows=1000000 width=12)
(6 rows)

So that's 263902 vs. 341665, yet we still prefer the incremental mode.

Problem is well-defined, thank you.
I'll check what can be done in this field today.

I think solving this may be fairly straight-forward. Essentially, until
now we only had one way to do the sort, so it was OK to make the sort
implicit by checking if the path is sorted

if (input not sorted)
{
... add a Sort node ...
}

But now we have multiple possible ways to do the sort, with different
startup/total costs. So the places that create the sorts need to
actually generate the Sort paths for each sort alternative, and store
the information in the Sort node (instead of relying on pathkeys).

Ultimately, this should simplify the createplan.c places making all the
make_sort calls unnecessary (i.e. the input should be already sorted
when needed). Otherwise it'd mean the decision needs to be done locally,
but I don't think that should be needed.

But it's surely a fairly invasive change to the patch ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#78

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Tomas Vondra (#77)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

Hi!

On Tue, Apr 3, 2018 at 2:10 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

I think solving this may be fairly straight-forward. Essentially, until
now we only had one way to do the sort, so it was OK to make the sort
implicit by checking if the path is sorted

if (input not sorted)
{
... add a Sort node ...
}

But now we have multiple possible ways to do the sort, with different
startup/total costs. So the places that create the sorts need to
actually generate the Sort paths for each sort alternative, and store
the information in the Sort node (instead of relying on pathkeys).

Ultimately, this should simplify the createplan.c places making all the
make_sort calls unnecessary (i.e. the input should be already sorted
when needed). Otherwise it'd mean the decision needs to be done locally,
but I don't think that should be needed.

But it's surely a fairly invasive change to the patch ...

Right, there are situation when incremental sort has lower startup cost,
but higher total cost. In order to find lower cost, we ideally should
generate
paths for both full sort and incremental sort. However, that would increase
number of total pathes, and could slowdown planning time. Another issue
that we don't always generate pathes for sort. And yes, it would be rather
invasive. So, that doesn't look feasible to get into 11.

Intead, I decided to cut usage of incremental sort. Now, incremental sort
is generated only in create_sort_path(). Cheaper path selected between
incremental sort and full sort with taking limit_tuples into account.
That limits usage of incremental sort, however risk of regression by this
patch is also minimal. In fact, incremental sort will be used only when
sort is explicitly specified and simultaneously LIMIT is specified or
dataset to be sorted is large and incremental sort saves disk IO.

Attached patch also incorporates following commits made by Alexander
Kuzmenkov:
* Rename fields of IncrementalSortState to snake_case for the sake of
consistency.
* Rename group test function to isCurrentGroup.
* Comments from Tomas Vondra about nodeIncrementalSort.c
* Add a test for incremental sort.
* Add a separate function to calculate costs of incremental sort.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-22.patchapplication/octet-stream; name=incremental-sort-22.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index fa0d1db5fb..2c0c6c3768 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index cf32be4bfe..96c9eb7ea6 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -514,7 +514,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a189a8efc3..1145a9bdda 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3717,6 +3717,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79f639d5e2..da9b030670 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -81,6 +81,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -94,7 +96,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -102,6 +104,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1067,6 +1071,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1677,6 +1684,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2006,12 +2019,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2022,7 +2052,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2046,7 +2076,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2115,7 +2145,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2172,7 +2202,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2185,13 +2215,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2231,9 +2262,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2441,6 +2476,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->group_count);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->group_count, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		group_count;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			group_count = incrsortstate->shared_info->sinfo[n].group_count;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, group_count);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, group_count, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 76d87eea49..c2f06da4e5 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..aaf8bb5177 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * In spite of full sort, incremental sort is keeping in memory
+			 * only last batch.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 52f1a96db5..fc3910502b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -281,6 +282,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -494,6 +499,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -978,6 +988,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1227,6 +1240,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index a3fb4495d2..943ca65372 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 1b1334006f..77013909a8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -373,7 +373,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
-												  NULL, false);
+												  NULL, false, false);
 	}
 
 	aggstate->current_phase = newphase;
@@ -460,7 +460,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, false);
+									 work_mem, NULL, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..5f28a3a5ea
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,673 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, parcitularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *tupleSlot)
+{
+	int presortedCols, i;
+	TupleTableSlot *group_pivot = node->group_pivot;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * We do assume the input is sorted by keys (0, ... n), which means
+	 * the tail keys are more likely to change. So we do the comparison
+	 * from the end, to minimize the number of function calls.
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(group_pivot, attno, &isnullA);
+		datumB = slot_getattr(tupleSlot, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least MIN_GROUP_SIZE tuples.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from the current sorted group set if available.
+	 * If there are no more tuples in the current group, we need to try
+	 * to fetch more tuples from the input and build another group.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * First time through or no tuples in the current group. Read next
+	 * batch of tuples from the outer plan and pass them to tuplesort.c.
+	 * Subsequent calls just fetch tuples from tuplesort, until the group
+	 * is exhausted, at which point we build the next group.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/*
+	 * Initialize tuplesort module (needed only before the first group).
+	 */
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		SO1_printf("ExecIncrementalSort: %s\n",
+				   "calling tuplesort_begin_heap");
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 *
+		 * XXX The claim about abbreviated keys seems rather dubious, IMHO.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->group_count++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* If we got a leftover tuple from the last group, pass it to tuplesort. */
+	if (!TupIsNull(node->group_pivot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->group_pivot);
+		ExecClearTuple(node->group_pivot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/*
+		 * Accumulate the next group of presorted tuples for tuplesort.
+		 * We always accumulate at least MIN_GROUP_SIZE tuples, and only
+		 * then we start to compare the prefix keys.
+		 *
+		 * The last tuple is kept as a pivot, so that we can determine if
+		 * the subsequent tuples have the same prefix key (same group).
+		 */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Keep the last tuple in minimal group as a pivot. */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->group_pivot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/*
+			 * Iterate while presorted cols are the same as in the pivot
+			 * tuple.
+			 *
+			 * After accumulating at least MIN_GROUP_SIZE tuples (we don't
+			 * know how many groups are there in that set), we need to keep
+			 * accumulating until we reach the end of the group. Only then
+			 * we can do the sort and output all the tuples.
+			 *
+			 * We compare the prefix keys to the pivot - if the prefix keys
+			 * are the same the tuple belongs to the same group, so we pass
+			 * it to the tuplesort.
+			 *
+			 * If the prefix differs, we've reached the end of the group. We
+			 * need to keep the last tuple, so we copy it into the pivot slot
+			 * (it does not serve as pivot, though).
+			 */
+			if (isCurrentGroup(node, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].group_count =
+															node->group_count;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..457e774b3d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,9 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c3efca3c45..718f806f0d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -924,6 +924,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -935,13 +953,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4869,6 +4903,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c8d962670e..e12855a094 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -893,12 +893,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -920,6 +918,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3781,6 +3797,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 4518fa0cdb..9b9f4d11dc 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2091,12 +2091,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2105,6 +2106,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2674,6 +2701,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c4e4db15a6..ae68595e1b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3667,6 +3667,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 47729de896..91a76294e6 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1611,9 +1612,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1640,39 +1641,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1711,7 +1696,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1722,7 +1707,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1733,12 +1718,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1749,8 +1734,189 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   List *pathkeys, Cost input_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort, which is done by cost_sort_internal().
+ */
+void
+cost_incremental_sort(Cost *startup_cost, Cost *run_cost,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost	input_run_cost = input_total_cost - input_startup_cost;
+
+	double	output_tuples,
+			output_groups,
+			group_tuples,
+			input_groups;
+
+	Cost	group_startup_cost,
+			group_run_cost;
+
+	*startup_cost = input_startup_cost;
+	*run_cost = 0;
+
+	if (!enable_incrementalsort)
+		*startup_cost += disable_cost;
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	Assert(presorted_keys != 0);
 
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	group_tuples = input_tuples / input_groups;
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+	{
+		output_tuples = limit_tuples;
+		output_groups = floor(output_tuples / group_tuples) + 1;
+	}
+	else
+	{
+		output_tuples = input_tuples;
+		output_groups = input_groups;
+	}
+
+	/* Startup cost of incremental sort is the startup cost of its first group. */
+	*startup_cost += group_startup_cost;
+	*startup_cost += input_run_cost * (1.0 / input_groups);
+
+	*run_cost += group_run_cost * output_groups
+			+ group_startup_cost * (output_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	*run_cost += (cpu_tuple_cost + comparison_cost) * output_tuples;
+	*run_cost += 2.0 * cpu_tuple_cost * output_groups;
+
+	/*
+	 * Account for input run cost. Unlike full sort, we don't have to read
+	 * the entire input if we have a limit clause.
+	 */
+	*run_cost += input_run_cost * output_tuples / input_tuples *
+				 (1.0 - (1.0 / input_groups));
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * Sort can be either full sort of relation or incremental sort when the input
+ * path is already sorted by leading pathkeys. The number of such pathkeys
+ * is given by 'presorted_keys'.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	if (presorted_keys > 0)
+		cost_incremental_sort(&startup_cost, &run_cost,
+							  root, pathkeys, presorted_keys,
+							  input_startup_cost, input_total_cost,
+							  tuples, width, comparison_cost, sort_mem,
+							  limit_tuples);
+	else
+		cost_full_sort(&startup_cost, &run_cost,
+					   pathkeys, input_total_cost,
+					   tuples, width, comparison_cost, sort_mem,
+					   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
@@ -1945,6 +2111,16 @@ cost_append(AppendPath *apath)
  *
  * As in cost_sort, we charge two operator evals per tuple comparison.
  *
+ * The sort may also be incremental, when the input data is already sorted
+ * by a prefix of the requested pathkeys.  In that case we estimate the
+ * number of groups the input data is divided into (by the prefix keys), and
+ * then apply the same costing criteria as for regular sort.  For example the
+ * sort_mem limit is applied on per-group size (assuming average group size),
+ * not the total volume of data.
+ *
+ * If LIMIT is specified, incremental sort only needs to pull and sort
+ * a subset of the input data, unlike the regular sort.
+ *
  * 'pathkeys' is a list of sort keys
  * 'n_streams' is the number of input streams
  * 'input_startup_cost' is the sum of the input streams' startup costs
@@ -2728,6 +2904,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  0,
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->pathtarget->width,
@@ -2754,6 +2932,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  0,
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->pathtarget->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..1d37685988 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,6 +22,7 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
@@ -327,6 +328,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL); 
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1580,26 +1626,45 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by incremental sort.
  */
-static int
-pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
+int
+pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys)
 {
-	if (root->query_pathkeys == NIL)
+	int	n_common_pathkeys;
+
+	if (query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	if (pathkeys_common_contained_in(query_pathkeys, pathkeys, &n_common_pathkeys))
 	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		/* Full match of pathkeys: always useful */
+		return n_common_pathkeys;
+	}
+	else
+	{
+		if (enable_incrementalsort)
+		{
+			/*
+			 * Return the number of path keys in common, or 0 if there are none.
+			 * Any leading common pathkeys could be useful for ordering because
+			 * we can use the incremental sort.
+			 */
+			return n_common_pathkeys;
+		}
+		else
+		{
+			/*
+			 * When incremental sort is disabled, pathkeys are useful only when
+			 * they do contain all the query pathkeys.
+			 */
+			return 0;
+		}
 	}
-
-	return 0;					/* path ordering not useful */
 }
 
 /*
@@ -1615,7 +1680,7 @@ truncate_useless_pathkeys(PlannerInfo *root,
 	int			nuseful2;
 
 	nuseful = pathkeys_useful_for_merging(root, rel, pathkeys);
-	nuseful2 = pathkeys_useful_for_ordering(root, pathkeys);
+	nuseful2 = pathkeys_useful_for_ordering(root->query_pathkeys, pathkeys);
 	if (nuseful2 > nuseful)
 		nuseful = nuseful2;
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 99d0736029..8d39b5c2dc 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -242,7 +242,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -258,7 +258,7 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
 						 Plan *lefttree);
@@ -454,6 +454,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1183,7 +1184,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
 		{
-			Sort	   *sort = make_sort(subplan, numsortkeys,
+			Sort	   *sort = make_sort(subplan, numsortkeys, 0,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1563,11 +1564,14 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
+	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 0,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
+	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
@@ -1717,6 +1721,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1726,6 +1731,11 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->presortedCols;
+	else
+		n_common_pathkeys = 0;
+
 	/*
 	 * make_sort_from_pathkeys() indirectly calls find_ec_member_for_tle(),
 	 * which will ignore any child EC members that don't belong to the given
@@ -1734,7 +1744,8 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	 */
 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
 								   IS_OTHER_REL(best_path->subpath->parent) ?
-								   best_path->path.parent->relids : NULL);
+								   best_path->path.parent->relids : NULL,
+								   n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -3932,7 +3943,8 @@ create_mergejoin_plan(PlannerInfo *root,
 		Relids		outer_relids = outer_path->parent->relids;
 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
 												   best_path->outersortkeys,
-												   outer_relids);
+												   outer_relids,
+												   0);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3946,7 +3958,8 @@ create_mergejoin_plan(PlannerInfo *root,
 		Relids		inner_relids = inner_path->parent->relids;
 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
 												   best_path->innersortkeys,
-												   inner_relids);
+												   inner_relids,
+												   0);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -5001,8 +5014,13 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
+	int			presorted_cols = 0;
 
-	cost_sort(&sort_path, root, NIL,
+	if (IsA(plan, IncrementalSort))
+		presorted_cols = ((IncrementalSort *) plan)->presortedCols;
+
+	cost_sort(&sort_path, root, NIL, presorted_cols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -5593,13 +5611,31 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	/* Always use regular sort node when enable_incrementalsort = false */
+	if (!enable_incrementalsort)
+		presortedCols = 0;
+
+	if (presortedCols == 0)
+	{
+		node = makeNode(Sort);
+	}
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->presortedCols = presortedCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5932,9 +5968,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5954,7 +5992,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5997,7 +6035,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6051,7 +6089,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6723,6 +6761,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 15c8d34c70..6a595c3190 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4855,13 +4855,13 @@ create_ordered_paths(PlannerInfo *root,
 	foreach(lc, input_rel->pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
-		bool		is_sorted;
+		int			n_useful_pathkeys;
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
+														 path->pathkeys);
+		if (path == cheapest_input_path || n_useful_pathkeys > 0)
 		{
-			if (!is_sorted)
+			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -5994,8 +5994,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->reltarget->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
@@ -6233,14 +6234,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
-			bool		is_sorted;
+			int			n_useful_pathkeys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
-			if (path == cheapest_path || is_sorted)
+			n_useful_pathkeys = pathkeys_useful_for_ordering(
+									root->group_pathkeys, path->pathkeys);
+			if (path == cheapest_path || n_useful_pathkeys > 0)
 			{
-				/* Sort the cheapest-total path if it isn't already sorted */
-				if (!is_sorted)
+				/*
+				 * Sort the path if it isn't already sorted.  Sort might
+				 * be needed for cheapest-total or path sorted by prefix
+				 * of required pathkeys.
+				 */
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -6303,12 +6308,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				int			n_useful_pathkeys;
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
-				 * sorting anything but the cheapest path.
+				 * non-incremental sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				n_useful_pathkeys = pathkeys_useful_for_ordering(
+										root->group_pathkeys, path->pathkeys);
+				if (n_useful_pathkeys == 0 &&
+					path != partially_grouped_rel->cheapest_total_path)
+					continue;
+				if (n_useful_pathkeys < list_length(root->group_pathkeys))
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 833a92f538..af0b720067 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 83008d7661..313cad266f 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2795,6 +2795,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 5236ab378e..1b23a3f8c5 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1110,7 +1110,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_path->startup_cost;
 	sorted_p.total_cost = input_path->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0,
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_path->rows, input_path->pathtarget->width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 416b3f9578..92005bab1d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1381,6 +1381,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  0,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->pathtarget->width,
@@ -1628,7 +1630,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  subpath->pathtarget->width,
@@ -1751,6 +1754,8 @@ create_gather_merge_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		cost_sort(&sort_path,
 				  root,
 				  pathkeys,
+				  0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  subpath->rows,
 				  subpath->pathtarget->width,
@@ -2610,9 +2615,98 @@ create_sort_path(PlannerInfo *root,
 				 List *pathkeys,
 				 double limit_tuples)
 {
-	SortPath   *pathnode = makeNode(SortPath);
+	SortPath   *pathnode;
+	int			n_common_pathkeys;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * Try incremental sort when it's enabled and there are common pathkeys,
+	 * use regular sort otherwise.
+	 */
+	if (enable_incrementalsort)
+		n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+	else
+		n_common_pathkeys = 0;
+
+	/*
+	 * Incremental sort might have higher total cost than full sort in some
+	 * cases.  So, we need to choose between incremental and full sort.  We
+	 * do this taking into account 'limit_tuples'.
+	 */
+	if (n_common_pathkeys > 0)
+	{
+		Cost	incremental_startup_cost,
+				incremental_run_cost,
+				full_startup_cost,
+				full_run_cost;
+		double	fraction;
+
+		cost_incremental_sort(&incremental_startup_cost,
+							  &incremental_run_cost,
+							  root, pathkeys, n_common_pathkeys,
+							  subpath->startup_cost,
+							  subpath->total_cost,
+							  subpath->rows,
+							  subpath->pathtarget->width,
+							  0.0,
+							  work_mem, limit_tuples);
+
+		cost_full_sort(&full_startup_cost,
+					   &full_run_cost,
+					   pathkeys,
+					   subpath->total_cost,
+					   subpath->rows,
+					   subpath->pathtarget->width,
+					   0.0,
+					   work_mem, limit_tuples);
+
+		fraction = limit_tuples / subpath->rows;
+		if (fraction <= 0.0 || fraction >= 1.0)
+			fraction = 1.0;
+
+		if (incremental_startup_cost + incremental_run_cost * fraction >=
+			full_startup_cost + full_run_cost * fraction)
+		{
+			startup_cost = full_startup_cost;
+			run_cost = full_run_cost;
+			n_common_pathkeys = 0;
+		}
+		else
+		{
+			startup_cost = incremental_startup_cost;
+			run_cost = incremental_run_cost;
+		}
+	}
+	else
+	{
+		cost_full_sort(&startup_cost,
+					   &run_cost,
+					   pathkeys,
+					   subpath->total_cost,
+					   subpath->rows,
+					   subpath->pathtarget->width,
+					   0.0,			/* XXX comparison_cost shouldn't be 0? */
+					   work_mem, limit_tuples);
+	}
+
+	if (n_common_pathkeys == 0)
+	{
+		pathnode = makeNode(SortPath);
+		pathnode->path.pathtype = T_Sort;
+	}
+	else
+	{
+		IncrementalSortPath   *incpathnode;
+
+		incpathnode = makeNode(IncrementalSortPath);
+		pathnode = &incpathnode->spath;
+		pathnode->path.pathtype = T_IncrementalSort;
+		incpathnode->presortedCols = n_common_pathkeys;
+	}
+
+	Assert(n_common_pathkeys < list_length(pathkeys));
 
-	pathnode->path.pathtype = T_Sort;
 	pathnode->path.parent = rel;
 	/* Sort doesn't project, so use source path's pathtarget */
 	pathnode->path.pathtarget = subpath->pathtarget;
@@ -2626,12 +2720,9 @@ create_sort_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 
-	cost_sort(&pathnode->path, root, pathkeys,
-			  subpath->total_cost,
-			  subpath->rows,
-			  subpath->pathtarget->width,
-			  0.0,				/* XXX comparison_cost shouldn't be 0? */
-			  work_mem, limit_tuples);
+	pathnode->path.rows = subpath->rows;
+	pathnode->path.startup_cost = startup_cost;
+	pathnode->path.total_cost = startup_cost + run_cost;
 
 	return pathnode;
 }
@@ -2938,7 +3029,8 @@ create_groupingsets_path(PlannerInfo *root,
 			else
 			{
 				/* Account for cost of sort, but don't charge input cost again */
-				cost_sort(&sort_path, root, NIL,
+				cost_sort(&sort_path, root, NIL, 0,
+						  0.0,
 						  0.0,
 						  subpath->rows,
 						  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index ed36851fdd..a6e14af9b8 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -295,7 +295,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortNullsFirsts,
 												   work_mem,
 												   NULL,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 260ae264d8..5a2a983050 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -860,6 +860,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e433faad86..83665e0fb2 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,9 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +246,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +658,9 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
+
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +696,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +706,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +738,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +763,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +772,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -807,14 +829,15 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate, bool randomAccess)
+					 int workMem, SortCoordinate coordinate,
+					 bool randomAccess, bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -857,7 +880,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -890,7 +913,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1008,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1087,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1130,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1247,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1313,104 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/* XXX */
+	if (spaceUsedOnDisk > state->maxSpaceOnDisk ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2707,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2757,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3139,18 +3255,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ff63d179b2..728e12ab82 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1870,6 +1870,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1898,6 +1912,46 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index fce48026b6..d7cc21f446 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -127,6 +128,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -242,6 +244,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0a797f0a05..81f1844574 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -757,6 +757,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index a2dde70de5..5c207e7475 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1523,6 +1523,16 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d3269eae71..a037205219 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -106,8 +107,18 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 						 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   List *pathkeys, Cost input_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Cost *startup_cost, Cost *run_cost,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 50e180c554..26787a6221 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
@@ -229,6 +231,7 @@ extern List *make_inner_pathkeys_for_merge(PlannerInfo *root,
 extern List *trim_mergeclauses_for_inner_pathkeys(PlannerInfo *root,
 									 List *mergeclauses,
 									 List *pathkeys);
+extern int pathkeys_useful_for_ordering(List *query_pathkeys, List *pathkeys);
 extern List *truncate_useless_pathkeys(PlannerInfo *root,
 						  RelOptInfo *rel,
 						  List *pathkeys);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..eb260dfd8b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -193,7 +193,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
 					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 bool randomAccess, bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..fa7fb23319
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,45 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 76a8209ec2..b7b65fc62d 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a19ee08749..9dec75060d 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -88,7 +89,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(16 rows)
+(17 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 20d6745730..9ea21c12b9 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -84,7 +84,7 @@ test: select_into select_distinct select_distinct_on select_implicit select_havi
 # ----------
 # Another group of parallel tests
 # ----------
-test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index merge
+test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index merge incremental_sort
 
 # ----------
 # Another group of parallel tests
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a08169f256..9ec9a66295 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -90,6 +90,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..bd66228ada
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,18 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index c60d7d2342..1b05456316 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.

#79

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Alexander Korotkov (#78)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On 04/06/2018 01:43 AM, Alexander Korotkov wrote:

Hi!

On Tue, Apr 3, 2018 at 2:10 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

I think solving this may be fairly straight-forward. Essentially, until
now we only had one way to do the sort, so it was OK to make the sort
implicit by checking if the path is sorted

if (input not sorted)
{
... add a Sort node ...
}

But now we have multiple possible ways to do the sort, with different
startup/total costs. So the places that create the sorts need to
actually generate the Sort paths for each sort alternative, and store
the information in the Sort node (instead of relying on pathkeys).

Ultimately, this should simplify the createplan.c places making all the
make_sort calls unnecessary (i.e. the input should be already sorted
when needed). Otherwise it'd mean the decision needs to be done locally,
but I don't think that should be needed.

But it's surely a fairly invasive change to the patch ...

Right, there are situation when incremental sort has lower startup cost,
but higher total cost. In order to find lower cost, we ideally should
generate
paths for both full sort and incremental sort. However, that would increase
number of total pathes, and could slowdown planning time. Another issue
that we don't always generate pathes for sort. And yes, it would be rather
invasive. So, that doesn't look feasible to get into 11.

I agree that's probably not feasible for PG11, considering the freeze is
about 48h from now. Not necessarily because of amount of code needed to
do that (it might be fairly small, actually) but because of the risk of
regressions in other types of plans and lack of time for review/testing.

I do not think this would cause a significant increase in path number.
We already do have the (partially sorted) paths in pathlist, otherwise
v21 wouldn't be able to build the incremental sort path anyway. And the
plans that did the decision in createplan.c could fairly easily do the
decision when constructing the path, I believe.

Looking v21, this affects three different places:

1) create_merge_append_plan

For merge_append the issue is that generate_mergeappend_paths() calls
get_cheapest_path_for_pathkeys(), which however only looks at cheapest
startup/total path, and then simply falls-back to cheapest_total_path if
there are no suitably sorted paths. IMHO if you modify this to also
consider partially-sorted paths, it should work. You'll have to account
for the extra cost of incremental cost, and it needs to be fairly cheap
(perhaps even by first quickly computing some initial cost estimate -
see e.g. initial_cost_nestloop/final_cost_nestloop).

2) create_mergejoin_plan

For mergejoin, the issue is that sort_inner_and_outer() only looks at
cheapest_total_path for both sides, even before computing the merge
pathkeys. So some of the code would have to move into the foreach loop,
and the paths would be picked by get_cheapest_path_for_pathkeys(). This
time only using total cost, per the comment in sort_inner_and_outer().

3) create_gather_merge_plan

This seems fairly simple - the issue is that gather_grouping_paths()
only looks at cheapest_partial_path. Should be a matter of simply
calling the improved get_cheapest_path_for_pathkeys().

Of course, this is all fairly hand-wavy and it may turn out to be fairly
expensive in some cases. But we can use another trick - we don't need to
search through all partially sorted paths, because for each pathkey
prefix there can be just one "best" path for startup cost and one for
total cost. So we could maintain a much shorter list of partially sorted
paths, I think.

Intead, I decided to cut usage of incremental sort. Now, incremental sort
is generated only in create_sort_path(). Cheaper path selected between
incremental sort and full sort with taking limit_tuples into account.
That limits usage of incremental sort, however risk of regression by this
patch is also minimal. In fact, incremental sort will be used only when
sort is explicitly specified and simultaneously LIMIT is specified or
dataset to be sorted is large and incremental sort saves disk IO.

I personally am OK with reducing the scope of the patch like this. It's
still beneficial for the common ORDER BY + LIMIT case, which is good. I
don't think it may negatively affect other cases (at least I can't think
of any).

It's pretty obvious it may be extremely useful for the other cases too
(particularly for mergejoin on large tables, where it can allow
in-memory sort with quick startup).

But even if you managed to make the necessary code changes, it's
unlikely any experienced committer will look into such significant
change this close to the cutoff. Either they are going to be away for a
weekend, or they are already looking at other patches :-(

Attached patch also incorporates following commits made by Alexander
Kuzmenkov:
* Rename fields of IncrementalSortState to snake_case for the sake of
consistency.
* Rename group test function to isCurrentGroup.
* Comments from Tomas Vondra about nodeIncrementalSort.c
* Add a test for incremental sort.
* Add a separate function to calculate costs of incremental sort.

Those changes seem fine, but are still a couple of issues remaining:

1) pathkeys_useful_for_ordering() still uses enable_incrementalsort,
which I think is a bad idea. I've complained about it in my review on
31/3, and I don't see any explanation why this is a good idea.

2) Likewise, I've suggested that the claim about abbreviated keys in
nodeIncrementalsort.c is dubious. No response, and the XXX comment was
instead merged into the patch:

* XXX The claim about abbreviated keys seems rather dubious, IMHO.

3) There is a comment at cost_merge_append, despite there being no
relevant changes in that function. Misplaced comment?

4) It's not clear to me why INITIAL_MEMTUPSIZE is defined the way it is.
There needs to be a comment - the intent seems to be making it large
enough to exceed ALLOCSET_SEPARATE_THRESHOLD, but it's not quite clear
why that's a good idea.

5) I do get this warning when building the code:

costsize.c: In function ‘cost_incremental_sort’:
costsize.c:1812:2: warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
List *presortedExprs = NIL;
^~~~

6) The comment at cost_incremental_sort talks about cost_sort_internal,
but it's cost_sort_tuplesort I guess.

7) The new code in create_sort_path is somewhat ugly, I guess. It's
correct, but it really needs to be made easier to comprehend. I might
have time to look into that tomorrow, but I can't promise that.

Attached is a diff highlighting some of those places, and couple of
minor code formatting fixes.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

incremental-sort-v22-review.difftext/x-patch; name=incremental-sort-v22-review.diffDownload

commit 7603a0d1bdcf1ebead6f4671c8e2db96436c86ef
Author: Tomas Vondra <tomas@2ndquadrant.com>
Date:   Fri Apr 6 16:10:56 2018 +0200

    review

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 8d39b5c..2668cf3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1564,14 +1564,12 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
-	{
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
 									 0,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
 									 gm_plan->nullsFirst);
-	}
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6a595c3..6ff98b0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4817,6 +4817,8 @@ create_distinct_paths(PlannerInfo *root,
  * The only new path we need consider is an explicit sort on the
  * cheapest-total existing path.
  *
+ * XXX This comment needs updating, I guess.
+ *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
@@ -4856,12 +4858,24 @@ create_ordered_paths(PlannerInfo *root,
 	{
 		Path	   *path = (Path *) lfirst(lc);
 		int			n_useful_pathkeys;
+		bool		is_partially_sorted;
+		bool		is_sorted;
 
 		n_useful_pathkeys = pathkeys_useful_for_ordering(root->sort_pathkeys,
 														 path->pathkeys);
-		if (path == cheapest_input_path || n_useful_pathkeys > 0)
+
+		/*
+		 * The path is consireder partially sorted when it's sorted by
+		 * a prefix of the pathkeys (possibly all of them).
+		 */
+		is_partially_sorted = (n_useful_pathkeys > 0);
+
+		/* It's fully sorted when it's sorted by all requested keys. */
+		is_sorted = (n_useful_pathkeys == list_length(root->sort_pathkeys));
+
+		if (path == cheapest_input_path || is_partially_sorted)
 		{
-			if (n_useful_pathkeys < list_length(root->sort_pathkeys))
+			if (!is_sorted)
 			{
 				/* An explicit sort here can take advantage of LIMIT */
 				path = (Path *) create_sort_path(root,
@@ -6235,17 +6249,28 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		{
 			Path	   *path = (Path *) lfirst(lc);
 			int			n_useful_pathkeys;
+			bool		is_partially_sorted;
+			bool		is_sorted;
 
 			n_useful_pathkeys = pathkeys_useful_for_ordering(
 									root->group_pathkeys, path->pathkeys);
-			if (path == cheapest_path || n_useful_pathkeys > 0)
+			/*
+			 * The path is consireder partially sorted when it's sorted by
+			 * a prefix of the pathkeys (possibly all of them).
+			 */
+			is_partially_sorted = (n_useful_pathkeys > 0);
+
+			/* It's fully sorted when it's sorted by all requested keys. */
+			is_sorted = (n_useful_pathkeys == list_length(root->sort_pathkeys));
+
+			if (path == cheapest_path || is_partially_sorted)
 			{
 				/*
 				 * Sort the path if it isn't already sorted.  Sort might
 				 * be needed for cheapest-total or path sorted by prefix
 				 * of required pathkeys.
 				 */
-				if (n_useful_pathkeys < list_length(root->group_pathkeys))
+				if (!is_sorted)
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 83665e0..f8d105b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -661,7 +661,6 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
 static void tuplesort_free(Tuplesortstate *state);
 static void tuplesort_updatemax(Tuplesortstate *state);
 
-
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
  * any variant of SortTuples, using the appropriate comparetup function.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 5c207e7..815c567 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1532,7 +1532,6 @@ typedef struct IncrementalSortPath
 	int			presortedCols;	/* number of presorted columns */
 } IncrementalSortPath;
 
-
 /*
  * GroupPath represents grouping (of presorted input)
  *

#80

Alexander Kuzmenkov

a.kuzmenkov@postgrespro.ru

almost 8 years ago

In reply to: Tomas Vondra (#79)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

Hi all,

This is the other Alexander K. speaking.

On 06.04.2018 20:26, Tomas Vondra wrote:

I personally am OK with reducing the scope of the patch like this. It's
still beneficial for the common ORDER BY + LIMIT case, which is good. I
don't think it may negatively affect other cases (at least I can't think
of any).

I think we can reduce it even further. Just try incremental sort along
with full sort over the cheapest path in create_ordered_paths, and don't
touch anything else. This is a very minimal and a probably safe start,
and then we can continue working on other, more complex cases. In the
attached patch I tried to do this. We probably should also remove
changes in make_sort() and create a separate function
make_incremental_sort() for it, but I'm done for today.

1) pathkeys_useful_for_ordering() still uses enable_incrementalsort,
which I think is a bad idea. I've complained about it in my review on
31/3, and I don't see any explanation why this is a good idea.

Removed.

2) Likewise, I've suggested that the claim about abbreviated keys in
nodeIncrementalsort.c is dubious. No response, and the XXX comment was
instead merged into the patch:

* XXX The claim about abbreviated keys seems rather dubious, IMHO.

Not sure about that, maybe just use abbreviated keys for the first
version? Later we can research this more closely and maybe start
deciding whether to use abbrev on planning stage.

3) There is a comment at cost_merge_append, despite there being no
relevant changes in that function. Misplaced comment?

Removed.

4) It's not clear to me why INITIAL_MEMTUPSIZE is defined the way it is.
There needs to be a comment - the intent seems to be making it large
enough to exceed ALLOCSET_SEPARATE_THRESHOLD, but it's not quite clear
why that's a good idea.

Not sure myself, let's ask the other Alexander.

5) I do get this warning when building the code:

costsize.c: In function ‘cost_incremental_sort’:
costsize.c:1812:2: warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
List *presortedExprs = NIL;
^~~~

6) The comment at cost_incremental_sort talks about cost_sort_internal,
but it's cost_sort_tuplesort I guess.

Fixed.

7) The new code in create_sort_path is somewhat ugly, I guess. It's
correct, but it really needs to be made easier to comprehend. I might
have time to look into that tomorrow, but I can't promise that.

Removed this code altogether, now the costs are compared by add_path as
usual.

Attached is a diff highlighting some of those places, and couple of
minor code formatting fixes.

Applied.

Also some other changes from me:

Remove extra blank lines
label_sort_with_costsize shouldn't have to deal with IncrementalSort
plans, because they are only created from corresponding Path nodes.
Reword a comment in ExecSupportsBackwardsScan.
Clarify cost calculations.
enable_incrementalsort is checked at path level, we don't have to check
it again at plan level.
enable_sort should act as a cost-based soft disable for both incremental
and normal sort.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-23.patchtext/x-patch; name=incremental-sort-23.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index fa0d1db5fb..2c0c6c3768 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index cf32be4bfe..96c9eb7ea6 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -514,7 +514,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a189a8efc3..1145a9bdda 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3717,6 +3717,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79f639d5e2..da9b030670 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -81,6 +81,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -94,7 +96,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -102,6 +104,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1067,6 +1071,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1677,6 +1684,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2006,13 +2019,30 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
 /*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
+/*
  * Likewise, for a MergeAppend node.
  */
 static void
@@ -2022,7 +2052,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2046,7 +2076,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2115,7 +2145,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2172,7 +2202,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2185,13 +2215,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2231,9 +2262,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2442,6 +2477,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 }
 
 /*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->group_count);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->group_count, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		group_count;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			group_count = incrsortstate->shared_info->sinfo[n].group_count;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, group_count);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, group_count, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
+/*
  * Show information on hash buckets/batches.
  */
 static void
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 76d87eea49..c2f06da4e5 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..520aeefd83 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 52f1a96db5..fc3910502b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -281,6 +282,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -494,6 +499,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -978,6 +988,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1227,6 +1240,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index a3fb4495d2..943ca65372 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 1b1334006f..77013909a8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -373,7 +373,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
-												  NULL, false);
+												  NULL, false, false);
 	}
 
 	aggstate->current_phase = newphase;
@@ -460,7 +460,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, false);
+									 work_mem, NULL, false, false);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..5f28a3a5ea
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,673 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, parcitularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *tupleSlot)
+{
+	int presortedCols, i;
+	TupleTableSlot *group_pivot = node->group_pivot;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * We do assume the input is sorted by keys (0, ... n), which means
+	 * the tail keys are more likely to change. So we do the comparison
+	 * from the end, to minimize the number of function calls.
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(group_pivot, attno, &isnullA);
+		datumB = slot_getattr(tupleSlot, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least MIN_GROUP_SIZE tuples.
+ */
+#define MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from the current sorted group set if available.
+	 * If there are no more tuples in the current group, we need to try
+	 * to fetch more tuples from the input and build another group.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * First time through or no tuples in the current group. Read next
+	 * batch of tuples from the outer plan and pass them to tuplesort.c.
+	 * Subsequent calls just fetch tuples from tuplesort, until the group
+	 * is exhausted, at which point we build the next group.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/*
+	 * Initialize tuplesort module (needed only before the first group).
+	 */
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		SO1_printf("ExecIncrementalSort: %s\n",
+				   "calling tuplesort_begin_heap");
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least MIN_GROUP_SIZE size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.  We unlikely will
+		 * have huge groups with incremental sort.  Therefore usage of
+		 * abbreviated keys would be likely a waste of time.
+		 *
+		 * XXX The claim about abbreviated keys seems rather dubious, IMHO.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false,
+									true);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->group_count++;
+
+	/* Calculate remaining bound for bounded sort */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+
+	/* If we got a leftover tuple from the last group, pass it to tuplesort. */
+	if (!TupIsNull(node->group_pivot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->group_pivot);
+		ExecClearTuple(node->group_pivot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/*
+		 * Accumulate the next group of presorted tuples for tuplesort.
+		 * We always accumulate at least MIN_GROUP_SIZE tuples, and only
+		 * then we start to compare the prefix keys.
+		 *
+		 * The last tuple is kept as a pivot, so that we can determine if
+		 * the subsequent tuples have the same prefix key (same group).
+		 */
+		if (nTuples < MIN_GROUP_SIZE)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Keep the last tuple in minimal group as a pivot. */
+			if (nTuples == MIN_GROUP_SIZE - 1)
+				ExecCopySlot(node->group_pivot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/*
+			 * Iterate while presorted cols are the same as in the pivot
+			 * tuple.
+			 *
+			 * After accumulating at least MIN_GROUP_SIZE tuples (we don't
+			 * know how many groups are there in that set), we need to keep
+			 * accumulating until we reach the end of the group. Only then
+			 * we can do the sort and output all the tuples.
+			 *
+			 * We compare the prefix keys to the pivot - if the prefix keys
+			 * are the same the tuple belongs to the same group, so we pass
+			 * it to the tuplesort.
+			 *
+			 * If the prefix differs, we've reached the end of the group. We
+			 * need to keep the last tuple, so we copy it into the pivot slot
+			 * (it does not serve as pivot, though).
+			 */
+			if (isCurrentGroup(node, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].group_count =
+															node->group_count;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..457e774b3d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,9 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess,
+											  false);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d2e4aa3c2f..01cd7eea61 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -925,6 +925,24 @@ _copyMaterial(const Material *from)
 
 
 /*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
+/*
  * _copySort
  */
 static Sort *
@@ -935,13 +953,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4883,6 +4917,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index a6a1c16164..829d06090d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -894,12 +894,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -922,6 +920,24 @@ _outSort(StringInfo str, const Sort *node)
 }
 
 static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
+static void
 _outUnique(StringInfo str, const Unique *node)
 {
 	int			i;
@@ -3793,6 +3809,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 37e3568595..9516967fc4 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2108,12 +2108,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2122,6 +2123,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2693,6 +2720,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c4e4db15a6..ae68595e1b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3667,6 +3667,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 47729de896..f6d4bec556 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1611,9 +1612,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1640,39 +1641,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1711,7 +1696,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1722,7 +1707,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1733,12 +1718,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1749,8 +1734,183 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
 
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		output_tuples,
+				output_groups,
+				group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	if (!enable_sort)
+		startup_cost += disable_cost;
+
+	if (!enable_incrementalsort)
+		startup_cost += disable_cost;
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+	{
+		output_tuples = limit_tuples;
+		output_groups = floor(output_tuples / group_tuples) + 1;
+	}
+	else
+	{
+		output_tuples = input_tuples;
+		output_groups = input_groups;
+	}
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (output_groups - 1)
+		+ group_input_run_cost * (output_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * output_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * output_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..6b2ba366c9 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -327,6 +327,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL); 
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 99d0736029..c20c7c545d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -242,7 +242,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype, bool inner_unique,
 			   bool skip_mark_restore);
-static Sort *make_sort(Plan *lefttree, int numCols,
+static Sort *make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -258,7 +258,7 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   TargetEntry *tle,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids);
+						Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
 						 Plan *lefttree);
@@ -454,6 +454,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											   (GatherPath *) best_path);
 			break;
 		case T_Sort:
+		case T_IncrementalSort:
 			plan = (Plan *) create_sort_plan(root,
 											 (SortPath *) best_path,
 											 flags);
@@ -1183,7 +1184,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
 		{
-			Sort	   *sort = make_sort(subplan, numsortkeys,
+			Sort	   *sort = make_sort(subplan, numsortkeys, 0,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst);
 
@@ -1564,6 +1565,7 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
 		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
+									 0,
 									 gm_plan->sortColIdx,
 									 gm_plan->sortOperators,
 									 gm_plan->collations,
@@ -1717,6 +1719,7 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 {
 	Sort	   *plan;
 	Plan	   *subplan;
+	int			n_common_pathkeys;
 
 	/*
 	 * We don't want any excess columns in the sorted tuples, so request a
@@ -1726,6 +1729,11 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	subplan = create_plan_recurse(root, best_path->subpath,
 								  flags | CP_SMALL_TLIST);
 
+	if (IsA(best_path, IncrementalSortPath))
+		n_common_pathkeys = ((IncrementalSortPath *) best_path)->presortedCols;
+	else
+		n_common_pathkeys = 0;
+
 	/*
 	 * make_sort_from_pathkeys() indirectly calls find_ec_member_for_tle(),
 	 * which will ignore any child EC members that don't belong to the given
@@ -1734,7 +1742,8 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	 */
 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
 								   IS_OTHER_REL(best_path->subpath->parent) ?
-								   best_path->path.parent->relids : NULL);
+								   best_path->path.parent->relids : NULL,
+								   n_common_pathkeys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -3932,7 +3941,8 @@ create_mergejoin_plan(PlannerInfo *root,
 		Relids		outer_relids = outer_path->parent->relids;
 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
 												   best_path->outersortkeys,
-												   outer_relids);
+												   outer_relids,
+												   0);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		outer_plan = (Plan *) sort;
@@ -3946,7 +3956,8 @@ create_mergejoin_plan(PlannerInfo *root,
 		Relids		inner_relids = inner_path->parent->relids;
 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
 												   best_path->innersortkeys,
-												   inner_relids);
+												   inner_relids,
+												   0);
 
 		label_sort_with_costsize(root, sort, -1.0);
 		inner_plan = (Plan *) sort;
@@ -5000,17 +5011,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
 
-	cost_sort(&sort_path, root, NIL,
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5593,13 +5611,25 @@ make_mergejoin(List *tlist,
  * nullsFirst arrays already.
  */
 static Sort *
-make_sort(Plan *lefttree, int numCols,
+make_sort(Plan *lefttree, int numCols, int presortedCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	if (presortedCols == 0)
+		node = makeNode(Sort);
+	else
+	{
+		IncrementalSort    *incrementalSort;
+
+		incrementalSort = makeNode(IncrementalSort);
+		node = &incrementalSort->sort;
+		incrementalSort->presortedCols = presortedCols;
+	}
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5932,9 +5962,11 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
  */
 static Sort *
-make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
+make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -5954,7 +5986,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, presortedCols,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -5997,7 +6029,7 @@ make_sort_from_sortclauses(List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6051,7 +6083,7 @@ make_sort_from_groupcols(List *groupcls,
 		numsortkeys++;
 	}
 
-	return make_sort(lefttree, numsortkeys,
+	return make_sort(lefttree, numsortkeys, 0,
 					 sortColIdx, sortOperators,
 					 collations, nullsFirst);
 }
@@ -6723,6 +6755,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 15c8d34c70..a022d0e85d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4814,8 +4814,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4854,29 +4854,58 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
-			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
-			}
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
+			add_path(ordered_rel, sorted_path);
+		}
+		else if (input_path == cheapest_input_path)
+		{
+			/*
+			 * Sort the cheapest input path. An explicit sort here can take
+			 * advantage of LIMIT.
+			 */
+			sorted_path = (Path *) create_sort_path(root,
+													ordered_rel,
+													input_path,
+													root->sort_pathkeys,
+													limit_tuples);
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
+
+			/* Also consider incremental sort. */
+			if (presorted_keys > 0)
+			{
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 833a92f538..af0b720067 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 83008d7661..313cad266f 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2795,6 +2795,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 416b3f9578..dfee78c43e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2594,6 +2594,57 @@ create_set_projection_path(PlannerInfo *root,
 }
 
 /*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
+/*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
  *
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index ed36851fdd..a6e14af9b8 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -295,7 +295,8 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortNullsFirsts,
 												   work_mem,
 												   NULL,
-												   qstate->rescan_needed);
+												   qstate->rescan_needed,
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 71c2b4eff1..060790198a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -874,6 +874,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
 			NULL
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e433faad86..f8d105b564 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,9 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +246,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +658,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +695,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,14 +705,22 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
 	 * A dedicated child context used exclusively for caller passed tuples
@@ -715,7 +737,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +762,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +771,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -807,14 +828,15 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate, bool randomAccess)
+					 int workMem, SortCoordinate coordinate,
+					 bool randomAccess, bool skipAbbrev)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -857,7 +879,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0);
+		sortKey->abbreviate = (i == 0) && !skipAbbrev;
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -890,7 +912,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1007,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1086,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1129,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1246,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_free
  *
- *	Release resources and clean up.
- *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1312,104 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax 
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/* XXX */
+	if (spaceUsedOnDisk > state->maxSpaceOnDisk ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2706,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2756,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3139,18 +3254,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ff63d179b2..728e12ab82 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1870,6 +1870,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1898,6 +1912,46 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index b1e3d53f78..e83965215b 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -127,6 +128,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -242,6 +244,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0a797f0a05..81f1844574 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -757,6 +757,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index a2dde70de5..815c567199 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1524,6 +1524,15 @@ typedef struct SortPath
 } SortPath;
 
 /*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
+/*
  * GroupPath represents grouping (of presorted input)
  *
  * groupClause represents the columns to be grouped on; the input path
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d3269eae71..13b1c80632 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -109,6 +110,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 895bf6959d..72da4cec08 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -170,6 +170,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 						   RelOptInfo *rel,
 						   Path *subpath,
 						   PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 				 RelOptInfo *rel,
 				 Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 50e180c554..3285a8055b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..eb260dfd8b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -193,7 +193,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
 					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 bool randomAccess, bool skipAbbrev);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..fa7fb23319
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,45 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 76a8209ec2..b7b65fc62d 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a19ee08749..9dec75060d 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -88,7 +89,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(16 rows)
+(17 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 20d6745730..9ea21c12b9 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -84,7 +84,7 @@ test: select_into select_distinct select_distinct_on select_implicit select_havi
 # ----------
 # Another group of parallel tests
 # ----------
-test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index merge
+test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index merge incremental_sort
 
 # ----------
 # Another group of parallel tests
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a08169f256..9ec9a66295 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -90,6 +90,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..bd66228ada
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,18 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index c60d7d2342..1b05456316 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.

#81

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Alexander Kuzmenkov (#80)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <
a.kuzmenkov@postgrespro.ru> wrote:

On 06.04.2018 20:26, Tomas Vondra wrote:

I personally am OK with reducing the scope of the patch like this. It's
still beneficial for the common ORDER BY + LIMIT case, which is good. I
don't think it may negatively affect other cases (at least I can't think
of any).

I think we can reduce it even further. Just try incremental sort along
with full sort over the cheapest path in create_ordered_paths, and don't
touch anything else. This is a very minimal and a probably safe start, and
then we can continue working on other, more complex cases. In the attached
patch I tried to do this. We probably should also remove changes in
make_sort() and create a separate function make_incremental_sort() for it,
but I'm done for today.

I've done further unwedding of sort and incremental sort providing them
separate function for plan createion.

2) Likewise, I've suggested that the claim about abbreviated keys in

nodeIncrementalsort.c is dubious. No response, and the XXX comment was

instead merged into the patch:

* XXX The claim about abbreviated keys seems rather dubious, IMHO.

Not sure about that, maybe just use abbreviated keys for the first
version? Later we can research this more closely and maybe start deciding
whether to use abbrev on planning stage.

That comes from time when we're trying to make incremental sort to be always
not worse than full sort. Now, we have separate paths for full and
incremental sorts,
and some costing penalty for incremental sort. So, incremental sort should
be
selected only when it's expected to give big win. Thus, we can give up
with this
optimization at least in the initial version.

So, removed.

4) It's not clear to me why INITIAL_MEMTUPSIZE is defined the way it is.

There needs to be a comment - the intent seems to be making it large
enough to exceed ALLOCSET_SEPARATE_THRESHOLD, but it's not quite clear
why that's a good idea.

Not sure myself, let's ask the other Alexander.

I've added comment to INITIAL_MEMTUPSIZE. However, to be fair it's not
invention of this patch. Initial size of memtuples array was so previously.
What this patch did is just move it to the macro.

Also, this note hadn't been adressed yet.

On Sat, Mar 31, 2018 at 11:43 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com

wrote:

I'm wondering if a static MIN_GROUP_SIZE is good idea. For example, what
if the subplan is expected to return only very few tuples (say, 33), but
the query includes LIMIT 1. Now, let's assume the startup/total cost of
the subplan is 1 and 1000000. With MIN_GROUP_SIZE 32 we're bound to
execute it pretty much till the end, while we could terminate after the
first tuple (if the prefix changes).

So I think we should use a Min(limit,MIN_GROUP_SIZE) here, and perhaps
this should depend on average group size too.

I agree with that. For bounded sort, attached patch now selects minimal
group
size as Min(DEFAULT_MIN_GROUP_SIZE, bound). That should improve
"LIMIT small_number" case.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-24.patchapplication/octet-stream; name=incremental-sort-24.patchDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index fa0d1db5fb..2c0c6c3768 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index cf32be4bfe..96c9eb7ea6 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -514,7 +514,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a189a8efc3..1145a9bdda 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3717,6 +3717,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79f639d5e2..da9b030670 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -81,6 +81,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -94,7 +96,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -102,6 +104,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1067,6 +1071,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1677,6 +1684,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2006,12 +2019,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2022,7 +2052,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2046,7 +2076,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2115,7 +2145,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2172,7 +2202,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2185,13 +2215,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2231,9 +2262,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2441,6 +2476,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->group_count);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->group_count, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		group_count;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			group_count = incrsortstate->shared_info->sinfo[n].group_count;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, group_count);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, group_count, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 76d87eea49..c2f06da4e5 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..520aeefd83 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 52f1a96db5..fc3910502b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -281,6 +282,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -494,6 +499,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -978,6 +988,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1227,6 +1240,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index a3fb4495d2..943ca65372 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..0fbb63d4b2
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,681 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, parcitularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *tupleSlot)
+{
+	int presortedCols, i;
+	TupleTableSlot *group_pivot = node->group_pivot;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * We do assume the input is sorted by keys (0, ... n), which means
+	 * the tail keys are more likely to change. So we do the comparison
+	 * from the end, to minimize the number of function calls.
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(group_pivot, attno, &isnullA);
+		datumB = slot_getattr(tupleSlot, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples.  However, in the case
+ * of bounded sort where bound is less than DEFAULT_MIN_GROUP_SIZE we start
+ * looking for the new group when bound is done.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from the current sorted group set if available.
+	 * If there are no more tuples in the current group, we need to try
+	 * to fetch more tuples from the input and build another group.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * First time through or no tuples in the current group. Read next
+	 * batch of tuples from the outer plan and pass them to tuplesort.c.
+	 * Subsequent calls just fetch tuples from tuplesort, until the group
+	 * is exhausted, at which point we build the next group.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/*
+	 * Initialize tuplesort module (needed only before the first group).
+	 */
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		SO1_printf("ExecIncrementalSort: %s\n",
+				   "calling tuplesort_begin_heap");
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least minGroupSize size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->group_count++;
+
+	/*
+	 * Calculate remaining bound for bounded sort and minimal group size
+	 * accordingly.
+	 */
+	if (node->bounded)
+	{
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+		minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, node->bound - node->bound_Done);
+	}
+	else
+	{
+		minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+	}
+
+	/* If we got a leftover tuple from the last group, pass it to tuplesort. */
+	if (!TupIsNull(node->group_pivot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->group_pivot);
+		ExecClearTuple(node->group_pivot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/*
+		 * Accumulate the next group of presorted tuples for tuplesort.
+		 * We always accumulate at least minGroupSize tuples, and only
+		 * then we start to compare the prefix keys.
+		 *
+		 * The last tuple is kept as a pivot, so that we can determine if
+		 * the subsequent tuples have the same prefix key (same group).
+		 */
+		if (nTuples < minGroupSize)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Keep the last tuple in minimal group as a pivot. */
+			if (nTuples == minGroupSize - 1)
+				ExecCopySlot(node->group_pivot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/*
+			 * Iterate while presorted cols are the same as in the pivot
+			 * tuple.
+			 *
+			 * After accumulating at least minGroupSize tuples (we don't
+			 * know how many groups are there in that set), we need to keep
+			 * accumulating until we reach the end of the group. Only then
+			 * we can do the sort and output all the tuples.
+			 *
+			 * We compare the prefix keys to the pivot - if the prefix keys
+			 * are the same the tuple belongs to the same group, so we pass
+			 * it to the tuplesort.
+			 *
+			 * If the prefix differs, we've reached the end of the group. We
+			 * need to keep the last tuple, so we copy it into the pivot slot
+			 * (it does not serve as pivot, though).
+			 */
+			if (isCurrentGroup(node, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].group_count =
+															node->group_count;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..bdab33f5c4 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d2e4aa3c2f..01cd7eea61 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -924,6 +924,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -935,13 +953,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4883,6 +4917,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index a6a1c16164..829d06090d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -894,12 +894,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -921,6 +919,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3793,6 +3809,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 37e3568595..9516967fc4 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2108,12 +2108,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2122,6 +2123,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2693,6 +2720,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c4e4db15a6..ae68595e1b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3667,6 +3667,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 47729de896..f6d4bec556 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1611,9 +1612,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1640,39 +1641,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1711,7 +1696,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1722,7 +1707,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1733,12 +1718,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1749,8 +1734,183 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
 
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		output_tuples,
+				output_groups,
+				group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	if (!enable_sort)
+		startup_cost += disable_cost;
+
+	if (!enable_incrementalsort)
+		startup_cost += disable_cost;
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+	{
+		output_tuples = limit_tuples;
+		output_groups = floor(output_tuples / group_tuples) + 1;
+	}
+	else
+	{
+		output_tuples = input_tuples;
+		output_groups = input_groups;
+	}
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (output_groups - 1)
+		+ group_input_run_cost * (output_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * output_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * output_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..6b2ba366c9 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -327,6 +327,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL); 
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 99d0736029..34b2417c4c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -96,6 +96,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 					   int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 						 int flags);
@@ -245,6 +247,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 						   Relids relids,
 						   const AttrNumber *reqColIdx,
@@ -259,6 +265,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 						Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
 						 Plan *lefttree);
@@ -458,6 +466,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1741,6 +1754,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5000,17 +5039,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
 
-	cost_sort(&sort_path, root, NIL,
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5597,9 +5643,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5613,6 +5662,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -5959,6 +6039,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6723,6 +6839,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 15c8d34c70..a022d0e85d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4814,8 +4814,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4854,29 +4854,58 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
-			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
-			}
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
+			add_path(ordered_rel, sorted_path);
+		}
+		else if (input_path == cheapest_input_path)
+		{
+			/*
+			 * Sort the cheapest input path. An explicit sort here can take
+			 * advantage of LIMIT.
+			 */
+			sorted_path = (Path *) create_sort_path(root,
+													ordered_rel,
+													input_path,
+													root->sort_pathkeys,
+													limit_tuples);
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
+
+			/* Also consider incremental sort. */
+			if (presorted_keys > 0)
+			{
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 833a92f538..af0b720067 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 83008d7661..313cad266f 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2795,6 +2795,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 416b3f9578..dfee78c43e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2593,6 +2593,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 71c2b4eff1..060790198a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -873,6 +873,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e433faad86..029c43b1d5 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ * 
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +252,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1091,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1134,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1251,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1317,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2718,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2768,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3139,18 +3266,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ff63d179b2..728e12ab82 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1870,6 +1870,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1898,6 +1912,46 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index b1e3d53f78..e83965215b 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -127,6 +128,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -242,6 +244,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0a797f0a05..81f1844574 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -757,6 +757,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index a2dde70de5..815c567199 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1523,6 +1523,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d3269eae71..13b1c80632 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -109,6 +110,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 895bf6959d..72da4cec08 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -170,6 +170,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 						   RelOptInfo *rel,
 						   Path *subpath,
 						   PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 				 RelOptInfo *rel,
 				 Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 50e180c554..3285a8055b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..4cad0d4fc2 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -192,8 +192,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 int workMem, SortCoordinate coordinate, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +239,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..fa7fb23319
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,45 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 76a8209ec2..b7b65fc62d 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a19ee08749..9dec75060d 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -88,7 +89,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(16 rows)
+(17 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 20d6745730..9ea21c12b9 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -84,7 +84,7 @@ test: select_into select_distinct select_distinct_on select_implicit select_havi
 # ----------
 # Another group of parallel tests
 # ----------
-test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index merge
+test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index merge incremental_sort
 
 # ----------
 # Another group of parallel tests
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a08169f256..9ec9a66295 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -90,6 +90,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..bd66228ada
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,18 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index c60d7d2342..1b05456316 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.

#82

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Alexander Korotkov (#81)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On Sat, Apr 7, 2018 at 4:56 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <
a.kuzmenkov@postgrespro.ru> wrote:

On 06.04.2018 20:26, Tomas Vondra wrote:

I personally am OK with reducing the scope of the patch like this. It's
still beneficial for the common ORDER BY + LIMIT case, which is good. I
don't think it may negatively affect other cases (at least I can't think
of any).

I think we can reduce it even further. Just try incremental sort along
with full sort over the cheapest path in create_ordered_paths, and don't
touch anything else. This is a very minimal and a probably safe start, and
then we can continue working on other, more complex cases. In the attached
patch I tried to do this. We probably should also remove changes in
make_sort() and create a separate function make_incremental_sort() for it,
but I'm done for today.

I've done further unwedding of sort and incremental sort providing them
separate function for plan createion.

2) Likewise, I've suggested that the claim about abbreviated keys in

nodeIncrementalsort.c is dubious. No response, and the XXX comment was

instead merged into the patch:

* XXX The claim about abbreviated keys seems rather dubious, IMHO.

Not sure about that, maybe just use abbreviated keys for the first
version? Later we can research this more closely and maybe start deciding
whether to use abbrev on planning stage.

That comes from time when we're trying to make incremental sort to be
always
not worse than full sort. Now, we have separate paths for full and
incremental sorts,
and some costing penalty for incremental sort. So, incremental sort
should be
selected only when it's expected to give big win. Thus, we can give up
with this
optimization at least in the initial version.

So, removed.

4) It's not clear to me why INITIAL_MEMTUPSIZE is defined the way it is.

There needs to be a comment - the intent seems to be making it large
enough to exceed ALLOCSET_SEPARATE_THRESHOLD, but it's not quite clear
why that's a good idea.

Not sure myself, let's ask the other Alexander.

I've added comment to INITIAL_MEMTUPSIZE. However, to be fair it's not
invention of this patch. Initial size of memtuples array was so
previously.
What this patch did is just move it to the macro.

Also, this note hadn't been adressed yet.

On Sat, Mar 31, 2018 at 11:43 PM, Tomas Vondra <tomas.vondra@
2ndquadrant.com> wrote:

I'm wondering if a static MIN_GROUP_SIZE is good idea. For example, what
if the subplan is expected to return only very few tuples (say, 33), but
the query includes LIMIT 1. Now, let's assume the startup/total cost of
the subplan is 1 and 1000000. With MIN_GROUP_SIZE 32 we're bound to
execute it pretty much till the end, while we could terminate after the
first tuple (if the prefix changes).

So I think we should use a Min(limit,MIN_GROUP_SIZE) here, and perhaps
this should depend on average group size too.

I agree with that. For bounded sort, attached patch now selects minimal
group
size as Min(DEFAULT_MIN_GROUP_SIZE, bound). That should improve
"LIMIT small_number" case.

I've just noticed that incremental sort now is not used in
contrib/postgres_fdw.
It's even better assuming that we're going to limit use-cases of incremental
sort. I've rolled back all the changes made in tests of
contirb/postgres_fdw
by this patch. Revised version is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-25.patchapplication/octet-stream; name=incremental-sort-25.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a189a8efc3..1145a9bdda 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3717,6 +3717,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79f639d5e2..da9b030670 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -81,6 +81,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -94,7 +96,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -102,6 +104,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1067,6 +1071,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1677,6 +1684,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2006,12 +2019,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2022,7 +2052,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2046,7 +2076,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2115,7 +2145,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2172,7 +2202,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2185,13 +2215,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2231,9 +2262,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2441,6 +2476,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->group_count);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->group_count, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		group_count;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			group_count = incrsortstate->shared_info->sinfo[n].group_count;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, group_count);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, group_count, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 76d87eea49..c2f06da4e5 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..520aeefd83 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 52f1a96db5..fc3910502b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -281,6 +282,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -494,6 +499,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -978,6 +988,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1227,6 +1240,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index a3fb4495d2..943ca65372 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..0fbb63d4b2
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,681 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, parcitularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *tupleSlot)
+{
+	int presortedCols, i;
+	TupleTableSlot *group_pivot = node->group_pivot;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * We do assume the input is sorted by keys (0, ... n), which means
+	 * the tail keys are more likely to change. So we do the comparison
+	 * from the end, to minimize the number of function calls.
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(group_pivot, attno, &isnullA);
+		datumB = slot_getattr(tupleSlot, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples.  However, in the case
+ * of bounded sort where bound is less than DEFAULT_MIN_GROUP_SIZE we start
+ * looking for the new group when bound is done.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from the current sorted group set if available.
+	 * If there are no more tuples in the current group, we need to try
+	 * to fetch more tuples from the input and build another group.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * First time through or no tuples in the current group. Read next
+	 * batch of tuples from the outer plan and pass them to tuplesort.c.
+	 * Subsequent calls just fetch tuples from tuplesort, until the group
+	 * is exhausted, at which point we build the next group.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/*
+	 * Initialize tuplesort module (needed only before the first group).
+	 */
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		SO1_printf("ExecIncrementalSort: %s\n",
+				   "calling tuplesort_begin_heap");
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least minGroupSize size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->group_count++;
+
+	/*
+	 * Calculate remaining bound for bounded sort and minimal group size
+	 * accordingly.
+	 */
+	if (node->bounded)
+	{
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+		minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, node->bound - node->bound_Done);
+	}
+	else
+	{
+		minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+	}
+
+	/* If we got a leftover tuple from the last group, pass it to tuplesort. */
+	if (!TupIsNull(node->group_pivot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->group_pivot);
+		ExecClearTuple(node->group_pivot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/*
+		 * Accumulate the next group of presorted tuples for tuplesort.
+		 * We always accumulate at least minGroupSize tuples, and only
+		 * then we start to compare the prefix keys.
+		 *
+		 * The last tuple is kept as a pivot, so that we can determine if
+		 * the subsequent tuples have the same prefix key (same group).
+		 */
+		if (nTuples < minGroupSize)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Keep the last tuple in minimal group as a pivot. */
+			if (nTuples == minGroupSize - 1)
+				ExecCopySlot(node->group_pivot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/*
+			 * Iterate while presorted cols are the same as in the pivot
+			 * tuple.
+			 *
+			 * After accumulating at least minGroupSize tuples (we don't
+			 * know how many groups are there in that set), we need to keep
+			 * accumulating until we reach the end of the group. Only then
+			 * we can do the sort and output all the tuples.
+			 *
+			 * We compare the prefix keys to the pivot - if the prefix keys
+			 * are the same the tuple belongs to the same group, so we pass
+			 * it to the tuplesort.
+			 *
+			 * If the prefix differs, we've reached the end of the group. We
+			 * need to keep the last tuple, so we copy it into the pivot slot
+			 * (it does not serve as pivot, though).
+			 */
+			if (isCurrentGroup(node, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].group_count =
+															node->group_count;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..bdab33f5c4 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 9287baaedc..9b117f7f05 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -924,6 +924,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -935,13 +953,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4900,6 +4934,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 03a91c3352..51d0e3008c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -894,12 +894,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -921,6 +919,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3806,6 +3822,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 2812dc9646..ee730bd52d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2134,12 +2134,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2148,6 +2149,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2723,6 +2750,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 65a34a255d..b13f7a68ba 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3713,6 +3713,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 47729de896..f6d4bec556 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1611,9 +1612,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1640,39 +1641,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1711,7 +1696,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1722,7 +1707,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1733,12 +1718,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1749,8 +1734,183 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
 
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		output_tuples,
+				output_groups,
+				group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	if (!enable_sort)
+		startup_cost += disable_cost;
+
+	if (!enable_incrementalsort)
+		startup_cost += disable_cost;
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+	{
+		output_tuples = limit_tuples;
+		output_groups = floor(output_tuples / group_tuples) + 1;
+	}
+	else
+	{
+		output_tuples = input_tuples;
+		output_groups = input_groups;
+	}
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (output_groups - 1)
+		+ group_input_run_cost * (output_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * output_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * output_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..6b2ba366c9 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -327,6 +327,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL); 
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 99d0736029..34b2417c4c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -96,6 +96,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 					   int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 						 int flags);
@@ -245,6 +247,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 						   Relids relids,
 						   const AttrNumber *reqColIdx,
@@ -259,6 +265,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 						Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
 						 Plan *lefttree);
@@ -458,6 +466,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1741,6 +1754,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5000,17 +5039,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
 
-	cost_sort(&sort_path, root, NIL,
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5597,9 +5643,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5613,6 +5662,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -5959,6 +6039,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6723,6 +6839,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 008492bad5..b37c4ff933 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4840,8 +4840,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4880,29 +4880,58 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
-			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
-			}
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
+			add_path(ordered_rel, sorted_path);
+		}
+		else if (input_path == cheapest_input_path)
+		{
+			/*
+			 * Sort the cheapest input path. An explicit sort here can take
+			 * advantage of LIMIT.
+			 */
+			sorted_path = (Path *) create_sort_path(root,
+													ordered_rel,
+													input_path,
+													root->sort_pathkeys,
+													limit_tuples);
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
+
+			/* Also consider incremental sort. */
+			if (presorted_keys > 0)
+			{
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 833a92f538..af0b720067 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 83008d7661..313cad266f 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2795,6 +2795,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 416b3f9578..dfee78c43e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2593,6 +2593,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 71c2b4eff1..060790198a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -873,6 +873,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e433faad86..029c43b1d5 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ * 
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +252,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1091,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1134,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1251,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1317,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2718,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2768,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3139,18 +3266,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 538e679cdf..88f18e3701 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1876,6 +1876,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1904,6 +1918,46 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 4fc2de7184..cf9e2e64f9 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -127,6 +128,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0a797f0a05..81f1844574 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -757,6 +757,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index acb8814924..75569203f3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1534,6 +1534,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d3269eae71..13b1c80632 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -109,6 +110,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 895bf6959d..72da4cec08 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -170,6 +170,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 						   RelOptInfo *rel,
 						   Path *subpath,
 						   PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 				 RelOptInfo *rel,
 				 Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 50e180c554..3285a8055b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..4cad0d4fc2 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -192,8 +192,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 int workMem, SortCoordinate coordinate, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +239,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..fa7fb23319
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,45 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 76a8209ec2..b7b65fc62d 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a19ee08749..9dec75060d 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -88,7 +89,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(16 rows)
+(17 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 00c324dd44..1e0d6f4b62 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -84,7 +84,7 @@ test: select_into select_distinct select_distinct_on select_implicit select_havi
 # ----------
 # Another group of parallel tests
 # ----------
-test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index merge
+test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index merge incremental_sort
 
 # ----------
 # Another group of parallel tests
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 39c3fa9c85..c43abdf1fc 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -90,6 +90,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..bd66228ada
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,18 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index c60d7d2342..1b05456316 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.

#83

Teodor Sigaev

teodor@sigaev.ru

almost 8 years ago

In reply to: Alexander Korotkov (#82)

Re: [HACKERS] [PATCH] Incremental sort

by this patch.О©╫ Revised version is attached.

Fine, patch got several rounds of review in all its parts. Is any places
which should be improved before commit?

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#84

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Teodor Sigaev (#83)

Re: [HACKERS] [PATCH] Incremental sort

On 04/07/2018 04:37 PM, Teodor Sigaev wrote:

by this patch.О©╫ Revised version is attached.

Fine, patch got several rounds of review in all its parts. Is any
places which should be improved before commit?

I personally feel rather uneasy about committing it, TBH.

While I don't see any obvious issues in the patch at the moment, the
recent changes were rather significant so I might easily miss some
unexpected consequences. (OTOH it's true it was mostly about reduction
of scope, to limit the risks.)

I don't have time to do more review and testing on the latest patch
version, unfortunately, certainly not before the CF end.

So I guess the ultimate review / decision is up to you ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#85

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Alvaro Herrera (#70)

Re: [HACKERS] [PATCH] Incremental sort

On Wed, Mar 28, 2018 at 6:38 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Teodor Sigaev wrote:

BTW, patch had conflicts with master. Please, find rebased version

attached.

Despite by patch conflist patch looks commitable, has anybody objections

to

commit it?

Patch recieved several rounds of review during 2 years, and seems to me,
keeping it out from sources may cause a lost it. Although it suggests
performance improvement in rather wide usecases.

Can we have a recap on what the patch *does*? I see there's a
description in Alexander's first email
/messages/by-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=
V7wgDaTXdDd9=goN-gfA@mail.gmail.com
but that was a long time ago, and the patch has likely changed in the
meantime ...

Ggeneral idea hasn't been changed much since first email.
Incremental sort gives benefit when you need to sort your dataset
by some list of columns while you alredy have input presorted
by some prefix of that list of columns. Then you don't do full sort
of dataset, but rather sort groups where values of prefix columns
are equal (see header comment in nodeIncremenalSort.c).

Same example as in the first letter works, but plan displays
differently.

create table test as (select id, (random()*10000)::int as v1, random() as
v2 from generate_series(1,1000000) id);
create index test_v1_idx on test (v1);

# explain select * from test order by v1, v2 limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Limit (cost=1.26..1.26 rows=10 width=16)
-> Incremental Sort (cost=1.26..1.42 rows=1000000 width=16)
Sort Key: v1, v2
Presorted Key: v1
-> Index Scan using test_v1_idx on test (cost=0.42..47602.50
rows=1000000 width=16)
(5 rows)

# select * from test order by v1, v2 limit 10;
id | v1 | v2
--------+----+--------------------
216426 | 0 | 0.0697950166650116
96649 | 0 | 0.230586454737931
892243 | 0 | 0.677791305817664
323001 | 0 | 0.708638620562851
87458 | 0 | 0.923310813494027
224291 | 0 | 0.9349579163827
446366 | 0 | 0.984529701061547
376781 | 0 | 0.997424073051661
768246 | 1 | 0.127851997036487
666102 | 1 | 0.27093240711838
(10 rows)

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#86

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Alexander Korotkov (#85)

Re: [HACKERS] [PATCH] Incremental sort

Alexander Korotkov <a.korotkov@postgrespro.ru> writes:

On Wed, Mar 28, 2018 at 6:38 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Can we have a recap on what the patch *does*?

Ggeneral idea hasn't been changed much since first email.
Incremental sort gives benefit when you need to sort your dataset
by some list of columns while you alredy have input presorted
by some prefix of that list of columns. Then you don't do full sort
of dataset, but rather sort groups where values of prefix columns
are equal (see header comment in nodeIncremenalSort.c).

I dunno, how would you estimate whether this is actually a win or not?
I don't think our model of sort costs is anywhere near refined enough
or accurate enough to reliably predict whether this is better than
just doing it in one step. Even if the cost model is good, it's not
going to be better than our statistics about the number/size of the
groups in the first column(s), and that's a notoriously unreliable stat.

Given that we already have more than enough dubious patches that have
been shoved in in the last few days, I'd rather not pile on stuff that
there's any question about.

regards, tom lane

#87

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Tom Lane (#86)

Re: [HACKERS] [PATCH] Incremental sort

On 2018-04-07 12:06:52 -0400, Tom Lane wrote:

Alexander Korotkov <a.korotkov@postgrespro.ru> writes:

On Wed, Mar 28, 2018 at 6:38 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Can we have a recap on what the patch *does*?

Ggeneral idea hasn't been changed much since first email.
Incremental sort gives benefit when you need to sort your dataset
by some list of columns while you alredy have input presorted
by some prefix of that list of columns. Then you don't do full sort
of dataset, but rather sort groups where values of prefix columns
are equal (see header comment in nodeIncremenalSort.c).

I dunno, how would you estimate whether this is actually a win or not?
I don't think our model of sort costs is anywhere near refined enough
or accurate enough to reliably predict whether this is better than
just doing it in one step. Even if the cost model is good, it's not
going to be better than our statistics about the number/size of the
groups in the first column(s), and that's a notoriously unreliable stat.

Given that we already have more than enough dubious patches that have
been shoved in in the last few days, I'd rather not pile on stuff that
there's any question about.

I don't disagree with any of that. Just wanted to pipe up to say that
there's a fair argument to be made that this patch, which has lingered
for years, "deserves" more to mature in tree than some of the rest.

Greetings,

Andres Freund

#88

Teodor Sigaev

teodor@sigaev.ru

almost 8 years ago

In reply to: Tom Lane (#86)

Re: [HACKERS] [PATCH] Incremental sort

I dunno, how would you estimate whether this is actually a win or not?
I don't think our model of sort costs is anywhere near refined enough
or accurate enough to reliably predict whether this is better than
just doing it in one step. Even if the cost model is good, it's not
going to be better than our statistics about the number/size of the
groups in the first column(s), and that's a notoriously unreliable stat.

I think that improvement in cost calculation of sort should be a
separate patch, not directly connected to this one. Postpone patches
till other part will be ready to get max improvement for postponed ones
doesn't seem to me very good, especially if it suggests some improvement
right now.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#89

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Teodor Sigaev (#83)

1 attachment(s)

Re: [HACKERS] [PATCH] Incremental sort

On Sat, Apr 7, 2018 at 5:37 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:

by this patch. Revised version is attached.

Fine, patch got several rounds of review in all its parts. Is any places
which should be improved before commit?

Also I found that after planner changes of Alexander Kuzmenkov, incremental
sort
was used in cheapest_input_path() only if its child is cheapest total path.
That makes incremental sort to not get used almost never.
I've changed that to consider incremental sort path when we have some
presorted columns. I also have to put changes in postgres_fdw regression
tests back, because incremental sort was used right there.

This revision of the patch also includes commit message.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

incremental-sort-26.patchapplication/octet-stream; name=incremental-sort-26.patchDownload

commit 6428245702a40b3e3fa11bb64b7611cdd33a0778
Author: Alexander Korotkov <a.korotkov@postgrespro.ru>
Date:   Sat Apr 7 18:51:20 2018 +0300

    Implement incremental sort
    
    Incremental sort is an optimized variant of multikey sort for cases when the
    input is already sorted by a prefix of the sort keys.  For example when a sort
    by (key1, key2 ... keyN) is requested, and the input is already sorted by
    (key1, key2 ... keyM), M < N, we can divide the input into groups where keys
    (key1, ... keyM) are equal, and only sort on the remaining columns.
    
    Incremental sort can give a huge benefit when LIMIT clause is specified,
    then it wouldn't even have to read the whole input.  Another huge benefit
    of incremental sort is that sorting data in small groups may help to evade
    using disk during sort.  However, on small datasets which fit into memory
    incremental sort may be slightly slower than full sort.  That was reflected
    in costing.
    
    This patch implements very basic usage of incremental sort: it gets used
    only in create_ordered_paths(), while it sort can help in much more use cases,
    for instance in merge join.  But latter would require much more changes in
    optimizer and postponed for further releases.

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index e4d9469fdd..61775e6726 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index e1df952e7a..05c8df8da9 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -514,7 +514,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a189a8efc3..1145a9bdda 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3717,6 +3717,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79f639d5e2..da9b030670 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -81,6 +81,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -94,7 +96,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -102,6 +104,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1067,6 +1071,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1677,6 +1684,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2006,12 +2019,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2022,7 +2052,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2046,7 +2076,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2115,7 +2145,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2172,7 +2202,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2185,13 +2215,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2231,9 +2262,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2441,6 +2476,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->group_count);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->group_count, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		group_count;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			group_count = incrsortstate->shared_info->sinfo[n].group_count;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, group_count);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, group_count, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 76d87eea49..c2f06da4e5 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..520aeefd83 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 52f1a96db5..fc3910502b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -281,6 +282,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -494,6 +499,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -978,6 +988,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1227,6 +1240,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index a3fb4495d2..943ca65372 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..0fbb63d4b2
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,681 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, parcitularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *tupleSlot)
+{
+	int presortedCols, i;
+	TupleTableSlot *group_pivot = node->group_pivot;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * We do assume the input is sorted by keys (0, ... n), which means
+	 * the tail keys are more likely to change. So we do the comparison
+	 * from the end, to minimize the number of function calls.
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(group_pivot, attno, &isnullA);
+		datumB = slot_getattr(tupleSlot, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples.  However, in the case
+ * of bounded sort where bound is less than DEFAULT_MIN_GROUP_SIZE we start
+ * looking for the new group when bound is done.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from the current sorted group set if available.
+	 * If there are no more tuples in the current group, we need to try
+	 * to fetch more tuples from the input and build another group.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * First time through or no tuples in the current group. Read next
+	 * batch of tuples from the outer plan and pass them to tuplesort.c.
+	 * Subsequent calls just fetch tuples from tuplesort, until the group
+	 * is exhausted, at which point we build the next group.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/*
+	 * Initialize tuplesort module (needed only before the first group).
+	 */
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		SO1_printf("ExecIncrementalSort: %s\n",
+				   "calling tuplesort_begin_heap");
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least minGroupSize size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->group_count++;
+
+	/*
+	 * Calculate remaining bound for bounded sort and minimal group size
+	 * accordingly.
+	 */
+	if (node->bounded)
+	{
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+		minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, node->bound - node->bound_Done);
+	}
+	else
+	{
+		minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+	}
+
+	/* If we got a leftover tuple from the last group, pass it to tuplesort. */
+	if (!TupIsNull(node->group_pivot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->group_pivot);
+		ExecClearTuple(node->group_pivot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/*
+		 * Accumulate the next group of presorted tuples for tuplesort.
+		 * We always accumulate at least minGroupSize tuples, and only
+		 * then we start to compare the prefix keys.
+		 *
+		 * The last tuple is kept as a pivot, so that we can determine if
+		 * the subsequent tuples have the same prefix key (same group).
+		 */
+		if (nTuples < minGroupSize)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Keep the last tuple in minimal group as a pivot. */
+			if (nTuples == minGroupSize - 1)
+				ExecCopySlot(node->group_pivot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/*
+			 * Iterate while presorted cols are the same as in the pivot
+			 * tuple.
+			 *
+			 * After accumulating at least minGroupSize tuples (we don't
+			 * know how many groups are there in that set), we need to keep
+			 * accumulating until we reach the end of the group. Only then
+			 * we can do the sort and output all the tuples.
+			 *
+			 * We compare the prefix keys to the pivot - if the prefix keys
+			 * are the same the tuple belongs to the same group, so we pass
+			 * it to the tuplesort.
+			 *
+			 * If the prefix differs, we've reached the end of the group. We
+			 * need to keep the last tuple, so we copy it into the pivot slot
+			 * (it does not serve as pivot, though).
+			 */
+			if (isCurrentGroup(node, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].group_count =
+															node->group_count;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..bdab33f5c4 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 9287baaedc..9b117f7f05 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -924,6 +924,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -935,13 +953,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4900,6 +4934,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 03a91c3352..51d0e3008c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -894,12 +894,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -921,6 +919,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3806,6 +3822,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 2812dc9646..ee730bd52d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2134,12 +2134,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2148,6 +2149,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2723,6 +2750,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 65a34a255d..b13f7a68ba 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3713,6 +3713,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 47729de896..f6d4bec556 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1611,9 +1612,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1640,39 +1641,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1711,7 +1696,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1722,7 +1707,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1733,12 +1718,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1749,8 +1734,183 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
 
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		output_tuples,
+				output_groups,
+				group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	if (!enable_sort)
+		startup_cost += disable_cost;
+
+	if (!enable_incrementalsort)
+		startup_cost += disable_cost;
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+	{
+		output_tuples = limit_tuples;
+		output_groups = floor(output_tuples / group_tuples) + 1;
+	}
+	else
+	{
+		output_tuples = input_tuples;
+		output_groups = input_groups;
+	}
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (output_groups - 1)
+		+ group_input_run_cost * (output_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * output_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * output_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..3a3b2b6b14 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -327,6 +327,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL); 
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1587,19 +1632,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int	n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none.
+	 * Any leading common pathkeys could be useful for ordering because
+	 * we can use the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 99d0736029..34b2417c4c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -96,6 +96,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 					   int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 						 int flags);
@@ -245,6 +247,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 						   Relids relids,
 						   const AttrNumber *reqColIdx,
@@ -259,6 +265,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 						Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
 						 Plan *lefttree);
@@ -458,6 +466,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1741,6 +1754,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5000,17 +5039,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
 
-	cost_sort(&sort_path, root, NIL,
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5597,9 +5643,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5613,6 +5662,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -5959,6 +6039,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6723,6 +6839,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 008492bad5..aa3d97b77d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4840,8 +4840,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4880,29 +4880,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can take
+				 * advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 833a92f538..af0b720067 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 83008d7661..313cad266f 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2795,6 +2795,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 416b3f9578..dfee78c43e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2593,6 +2593,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 71c2b4eff1..060790198a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -873,6 +873,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e433faad86..029c43b1d5 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ * 
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +252,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1091,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1134,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1251,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1317,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2718,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2768,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3139,18 +3266,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 538e679cdf..88f18e3701 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1876,6 +1876,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1904,6 +1918,46 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 4fc2de7184..cf9e2e64f9 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -127,6 +128,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0a797f0a05..81f1844574 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -757,6 +757,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index acb8814924..75569203f3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1534,6 +1534,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d3269eae71..13b1c80632 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -109,6 +110,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 895bf6959d..72da4cec08 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -170,6 +170,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 						   RelOptInfo *rel,
 						   Path *subpath,
 						   PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 				 RelOptInfo *rel,
 				 Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 50e180c554..3285a8055b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..4cad0d4fc2 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -192,8 +192,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 int workMem, SortCoordinate coordinate, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +239,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..fa7fb23319
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,45 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 76a8209ec2..b7b65fc62d 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a19ee08749..9dec75060d 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -88,7 +89,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(16 rows)
+(17 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 00c324dd44..1e0d6f4b62 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -84,7 +84,7 @@ test: select_into select_distinct select_distinct_on select_implicit select_havi
 # ----------
 # Another group of parallel tests
 # ----------
-test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index merge
+test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index merge incremental_sort
 
 # ----------
 # Another group of parallel tests
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 39c3fa9c85..c43abdf1fc 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -90,6 +90,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..bd66228ada
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,18 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index c60d7d2342..1b05456316 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.

#90

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Teodor Sigaev (#88)

Re: [HACKERS] [PATCH] Incremental sort

Teodor Sigaev <teodor@sigaev.ru> writes:

I dunno, how would you estimate whether this is actually a win or not?
I don't think our model of sort costs is anywhere near refined enough
or accurate enough to reliably predict whether this is better than
just doing it in one step. Even if the cost model is good, it's not
going to be better than our statistics about the number/size of the
groups in the first column(s), and that's a notoriously unreliable stat.

I think that improvement in cost calculation of sort should be a
separate patch, not directly connected to this one. Postpone patches
till other part will be ready to get max improvement for postponed ones
doesn't seem to me very good, especially if it suggests some improvement
right now.

No, you misunderstand the point of my argument. Without a reasonably
reliable cost model, this patch could easily make performance *worse*
not better for many people, due to choosing incremental-sort plans
where they were really a loss.

If we were at the start of a development cycle and work were being
promised to be done later in the cycle to improve the planning aspect,
I'd be more charitable about it. But this isn't merely the end of a
cycle, it's the *last day*. Now is not the time to commit stuff that
needs, or even just might need, follow-on work.

regards, tom lane

#91

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 8 years ago

In reply to: Tom Lane (#90)

Re: [HACKERS] [PATCH] Incremental sort

On 04/07/2018 06:23 PM, Tom Lane wrote:

Teodor Sigaev <teodor@sigaev.ru> writes:

I dunno, how would you estimate whether this is actually a win or not?
I don't think our model of sort costs is anywhere near refined enough
or accurate enough to reliably predict whether this is better than
just doing it in one step. Even if the cost model is good, it's not
going to be better than our statistics about the number/size of the
groups in the first column(s), and that's a notoriously unreliable stat.

I think that improvement in cost calculation of sort should be a
separate patch, not directly connected to this one. Postpone patches
till other part will be ready to get max improvement for postponed ones
doesn't seem to me very good, especially if it suggests some improvement
right now.

No, you misunderstand the point of my argument. Without a reasonably
reliable cost model, this patch could easily make performance *worse*
not better for many people, due to choosing incremental-sort plans
where they were really a loss.

Yeah. Essentially the patch could push the planner to pick a path that
has low startup cost (and very high total cost), assuming it'll only
need to read small part of the input. But if the number of groups in the
input is low (perhaps just one huge group), that would be a regression.

If we were at the start of a development cycle and work were being
promised to be done later in the cycle to improve the planning aspect,
I'd be more charitable about it. But this isn't merely the end of a
cycle, it's the *last day*. Now is not the time to commit stuff that
needs, or even just might need, follow-on work.

+1 to that

FWIW I'm willing to spend some time on the patch for PG12, particularly
on the planner / costing part. The potential gains are too interesting.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#92

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: Tomas Vondra (#91)

Re: [HACKERS] [PATCH] Incremental sort

On Sat, Apr 7, 2018 at 11:57 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 04/07/2018 06:23 PM, Tom Lane wrote:

Teodor Sigaev <teodor@sigaev.ru> writes:

I dunno, how would you estimate whether this is actually a win or not?
I don't think our model of sort costs is anywhere near refined enough
or accurate enough to reliably predict whether this is better than
just doing it in one step. Even if the cost model is good, it's not
going to be better than our statistics about the number/size of the
groups in the first column(s), and that's a notoriously unreliable

stat.

I think that improvement in cost calculation of sort should be a
separate patch, not directly connected to this one. Postpone patches
till other part will be ready to get max improvement for postponed ones
doesn't seem to me very good, especially if it suggests some improvement
right now.

No, you misunderstand the point of my argument. Without a reasonably
reliable cost model, this patch could easily make performance *worse*
not better for many people, due to choosing incremental-sort plans
where they were really a loss.

Yeah. Essentially the patch could push the planner to pick a path that
has low startup cost (and very high total cost), assuming it'll only
need to read small part of the input. But if the number of groups in the
input is low (perhaps just one huge group), that would be a regression.

Yes, I think the biggest risk here is too small number of groups. More
precisely the risk is too large groups while total number of groups might
be large enough.

If we were at the start of a development cycle and work were being

promised to be done later in the cycle to improve the planning aspect,
I'd be more charitable about it. But this isn't merely the end of a
cycle, it's the *last day*. Now is not the time to commit stuff that
needs, or even just might need, follow-on work.

+1 to that

FWIW I'm willing to spend some time on the patch for PG12, particularly
on the planner / costing part. The potential gains are too interesting

Thank you very much for your efforts on reviewing this patch.
I'm looking forward work with you on this feature for PG12.

FWIW, I think that we're moving this patch in the right direction by
providing separate paths for incremental sort. It's much better than
deciding between full or incremental sort in-place. For sure, considereing
extra paths might cause planning time regression. But I think the
same statement is true about many other planning optimizations.
One thing be can do is to make enable_incrementalsort = off by
default. Then only users who understand improtance of incremental
sort will get extra planning time.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#93

David Steele

david@pgmasters.net

almost 8 years ago

In reply to: Alexander Korotkov (#92)

Re: [HACKERS] [PATCH] Incremental sort

On 4/9/18 11:56 AM, Alexander Korotkov wrote:

Thank you very much for your efforts on reviewing this patch.
I'm looking forward work with you on this feature for PG12.

Since there's a new patch I have changed the status to Needs Review and
moved the entry to the next CF.

Still, it seems to me that discussion and new patches will be required
to address Tom's concerns.

Regards,
--
-David
david@pgmasters.net

#94

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 8 years ago

In reply to: David Steele (#93)

Re: [HACKERS] [PATCH] Incremental sort

On Tue, Apr 10, 2018 at 4:15 PM, David Steele <david@pgmasters.net> wrote:

On 4/9/18 11:56 AM, Alexander Korotkov wrote:

Thank you very much for your efforts on reviewing this patch.
I'm looking forward work with you on this feature for PG12.

Since there's a new patch I have changed the status to Needs Review and
moved the entry to the next CF.

Right, thank you.

Still, it seems to me that discussion and new patches will be required

to address Tom's concerns.

Sounds correct for me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#95

James Coleman

jtc331@gmail.com

over 7 years ago

In reply to: Alexander Korotkov (#94)

1 attachment(s)

Re: Re: [HACKERS] [PATCH] Incremental sort

I've attached an updated copy of the patch that applies cleanly to
current master.

Attachments:

incremental-sort-26.patchtext/plain; charset=UTF-8; name=incremental-sort-26.patch; x-mac-creator=0; x-mac-type=0Download

commit 6428245702a40b3e3fa11bb64b7611cdd33a0778
Author: Alexander Korotkov <a.korotkov@postgrespro.ru>
Date:   Sat Apr 7 18:51:20 2018 +0300

    Implement incremental sort
    
    Incremental sort is an optimized variant of multikey sort for cases when the
    input is already sorted by a prefix of the sort keys.  For example when a sort
    by (key1, key2 ... keyN) is requested, and the input is already sorted by
    (key1, key2 ... keyM), M < N, we can divide the input into groups where keys
    (key1, ... keyM) are equal, and only sort on the remaining columns.
    
    Incremental sort can give a huge benefit when LIMIT clause is specified,
    then it wouldn't even have to read the whole input.  Another huge benefit
    of incremental sort is that sorting data in small groups may help to evade
    using disk during sort.  However, on small datasets which fit into memory
    incremental sort may be slightly slower than full sort.  That was reflected
    in costing.
    
    This patch implements very basic usage of incremental sort: it gets used
    only in create_ordered_paths(), while it sort can help in much more use cases,
    for instance in merge join.  But latter would require much more changes in
    optimizer and postponed for further releases.

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index e4d9469fdd..61775e6726 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1999,28 +1999,62 @@ SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2
  119
 (10 rows)
 
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
 EXPLAIN (VERBOSE, COSTS OFF)
-SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
-                             QUERY PLAN                              
----------------------------------------------------------------------
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Limit
-   Output: t1.c1, t2.c1
+   Output: t1.c3, t2.c3
    ->  Sort
-         Output: t1.c1, t2.c1
-         Sort Key: t1.c1, t2.c1
+         Output: t1.c3, t2.c3
+         Sort Key: t1.c3, t2.c3
          ->  Nested Loop
-               Output: t1.c1, t2.c1
+               Output: t1.c3, t2.c3
                ->  Foreign Scan on public.ft1 t1
-                     Output: t1.c1
-                     Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                     Output: t1.c3
+                     Remote SQL: SELECT c3 FROM "S 1"."T 1"
                ->  Materialize
-                     Output: t2.c1
+                     Output: t2.c3
                      ->  Foreign Scan on public.ft2 t2
-                           Output: t2.c1
-                           Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
+                           Output: t2.c3
+                           Remote SQL: SELECT c3 FROM "S 1"."T 1"
 (15 rows)
 
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+  c3   |  c3   
+-------+-------
+ 00001 | 00101
+ 00001 | 00102
+ 00001 | 00103
+ 00001 | 00104
+ 00001 | 00105
+ 00001 | 00106
+ 00001 | 00107
+ 00001 | 00108
+ 00001 | 00109
+ 00001 | 00110
+(10 rows)
+
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
+                                                                            QUERY PLAN                                                                             
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit
+   Output: t1.c1, t2.c1
+   ->  Foreign Scan
+         Output: t1.c1, t2.c1
+         Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+         Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) ORDER BY r1."C 1" ASC NULLS LAST, r2."C 1" ASC NULLS LAST
+(6 rows)
+
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  c1 | c1  
 ----+-----
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index e1df952e7a..05c8df8da9 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -514,7 +514,17 @@ SELECT t1.c1 FROM ft1 t1 WHERE EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c1)
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
--- CROSS JOIN, not pushed down
+-- CROSS JOIN, not pushed down.  For this query, essential optimization is top-N
+-- sort.  But it can't be processed at remote side, because we never do LIMIT
+-- push down.  Assuming that sorting is not worth it to push down, CROSS JOIN
+-- is also not pushed down in order to transfer less tuples over network.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
+-- CROSS JOIN, pushed down.  Unlike previous query, remote side is able to
+-- return tuples in given order without full sort, but using index scan and
+-- incremental sort.  This is much cheaper than full sort on local side, even
+-- despite we don't know LIMIT on remote side.
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a189a8efc3..1145a9bdda 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3717,6 +3717,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79f639d5e2..da9b030670 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -81,6 +81,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 				ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -94,7 +96,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -102,6 +104,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 				 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
@@ -1067,6 +1071,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1677,6 +1684,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2006,12 +2019,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2022,7 +2052,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2046,7 +2076,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2115,7 +2145,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2172,7 +2202,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2185,13 +2215,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2231,9 +2262,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2441,6 +2476,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->group_count);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->group_count, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		group_count;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			group_count = incrsortstate->shared_info->sinfo[n].group_count;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, group_count);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, group_count, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 76d87eea49..c2f06da4e5 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 9e78421978..520aeefd83 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -253,6 +254,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -525,8 +530,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 52f1a96db5..fc3910502b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -32,6 +32,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -281,6 +282,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -494,6 +499,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -918,6 +927,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -978,6 +988,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1227,6 +1240,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index a3fb4495d2..943ca65372 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -695,6 +701,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..0fbb63d4b2
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,681 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, parcitularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo.argnull[0] = false;
+		key->fcinfo.argnull[1] = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *tupleSlot)
+{
+	int presortedCols, i;
+	TupleTableSlot *group_pivot = node->group_pivot;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * We do assume the input is sorted by keys (0, ... n), which means
+	 * the tail keys are more likely to change. So we do the comparison
+	 * from the end, to minimize the number of function calls.
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(group_pivot, attno, &isnullA);
+		datumB = slot_getattr(tupleSlot, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo.arg[0] = datumA;
+		key->fcinfo.arg[1] = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo.isnull = false;
+
+		result = FunctionCallInvoke(&key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo.isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples.  However, in the case
+ * of bounded sort where bound is less than DEFAULT_MIN_GROUP_SIZE we start
+ * looking for the new group when bound is done.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from the current sorted group set if available.
+	 * If there are no more tuples in the current group, we need to try
+	 * to fetch more tuples from the input and build another group.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * First time through or no tuples in the current group. Read next
+	 * batch of tuples from the outer plan and pass them to tuplesort.c.
+	 * Subsequent calls just fetch tuples from tuplesort, until the group
+	 * is exhausted, at which point we build the next group.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/*
+	 * Initialize tuplesort module (needed only before the first group).
+	 */
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		SO1_printf("ExecIncrementalSort: %s\n",
+				   "calling tuplesort_begin_heap");
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least minGroupSize size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->group_count++;
+
+	/*
+	 * Calculate remaining bound for bounded sort and minimal group size
+	 * accordingly.
+	 */
+	if (node->bounded)
+	{
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+		minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, node->bound - node->bound_Done);
+	}
+	else
+	{
+		minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+	}
+
+	/* If we got a leftover tuple from the last group, pass it to tuplesort. */
+	if (!TupIsNull(node->group_pivot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->group_pivot);
+		ExecClearTuple(node->group_pivot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/*
+		 * Accumulate the next group of presorted tuples for tuplesort.
+		 * We always accumulate at least minGroupSize tuples, and only
+		 * then we start to compare the prefix keys.
+		 *
+		 * The last tuple is kept as a pivot, so that we can determine if
+		 * the subsequent tuples have the same prefix key (same group).
+		 */
+		if (nTuples < minGroupSize)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Keep the last tuple in minimal group as a pivot. */
+			if (nTuples == minGroupSize - 1)
+				ExecCopySlot(node->group_pivot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/*
+			 * Iterate while presorted cols are the same as in the pivot
+			 * tuple.
+			 *
+			 * After accumulating at least minGroupSize tuples (we don't
+			 * know how many groups are there in that set), we need to keep
+			 * accumulating until we reach the end of the group. Only then
+			 * we can do the sort and output all the tuples.
+			 *
+			 * We compare the prefix keys to the pivot - if the prefix keys
+			 * are the same the tuple belongs to the same group, so we pass
+			 * it to the tuplesort.
+			 *
+			 * If the prefix differs, we've reached the end of the group. We
+			 * need to keep the last tuple, so we copy it into the pivot slot
+			 * (it does not serve as pivot, though).
+			 */
+			if (isCurrentGroup(node, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].group_count =
+															node->group_count;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(estate, &incrsortstate->ss.ps);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)));
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 73f16c9aba..bdab33f5c4 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 9287baaedc..9b117f7f05 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -924,6 +924,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -935,13 +953,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4900,6 +4934,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 03a91c3352..51d0e3008c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -894,12 +894,10 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
 	int			i;
 
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -921,6 +919,24 @@ _outSort(StringInfo str, const Sort *node)
 		appendStringInfo(str, " %s", booltostr(node->nullsFirst[i]));
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3806,6 +3822,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 2812dc9646..ee730bd52d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2134,12 +2134,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2148,6 +2149,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2723,6 +2750,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 65a34a255d..b13f7a68ba 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3713,6 +3713,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 47729de896..f6d4bec556 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1611,9 +1612,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1640,39 +1641,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1711,7 +1696,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1722,7 +1707,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1733,12 +1718,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1749,8 +1734,183 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
 
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		output_tuples,
+				output_groups,
+				group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	if (!enable_sort)
+		startup_cost += disable_cost;
+
+	if (!enable_incrementalsort)
+		startup_cost += disable_cost;
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+	{
+		output_tuples = limit_tuples;
+		output_groups = floor(output_tuples / group_tuples) + 1;
+	}
+	else
+	{
+		output_tuples = input_tuples;
+		output_groups = input_groups;
+	}
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (output_groups - 1)
+		+ group_input_run_cost * (output_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * output_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * output_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6d1cc3b8a0..3a3b2b6b14 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -327,6 +327,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1587,19 +1632,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int	n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none.
+	 * Any leading common pathkeys could be useful for ordering because
+	 * we can use the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 99d0736029..34b2417c4c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -96,6 +96,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 					   int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 						 int flags);
@@ -245,6 +247,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 						   Relids relids,
 						   const AttrNumber *reqColIdx,
@@ -259,6 +265,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 					   Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 						Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 						 AttrNumber *grpColIdx,
 						 Plan *lefttree);
@@ -458,6 +466,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1741,6 +1754,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5000,17 +5039,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
 
-	cost_sort(&sort_path, root, NIL,
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5597,9 +5643,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5613,6 +5662,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -5959,6 +6039,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6723,6 +6839,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 008492bad5..aa3d97b77d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4840,8 +4840,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4880,29 +4880,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can take
+				 * advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 833a92f538..af0b720067 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -642,6 +642,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 83008d7661..313cad266f 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2795,6 +2795,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 416b3f9578..dfee78c43e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2593,6 +2593,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 71c2b4eff1..060790198a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -873,6 +873,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e433faad86..029c43b1d5 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +252,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1064,7 +1091,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1107,7 +1134,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1251,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1317,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2718,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2768,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3139,18 +3266,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 538e679cdf..88f18e3701 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1876,6 +1876,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1904,6 +1918,46 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 4fc2de7184..cf9e2e64f9 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -127,6 +128,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0a797f0a05..81f1844574 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -757,6 +757,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index acb8814924..75569203f3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1534,6 +1534,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d3269eae71..13b1c80632 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -61,6 +61,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -109,6 +110,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 895bf6959d..72da4cec08 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -170,6 +170,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 						   RelOptInfo *rel,
 						   Path *subpath,
 						   PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 				 RelOptInfo *rel,
 				 Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 50e180c554..3285a8055b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -189,6 +189,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d2e6754f04..4cad0d4fc2 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -192,8 +192,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate,
-					 bool randomAccess);
+					 int workMem, SortCoordinate coordinate, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel, int workMem,
 						SortCoordinate coordinate, bool randomAccess);
@@ -240,6 +239,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..fa7fb23319
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,45 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 76a8209ec2..b7b65fc62d 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a19ee08749..9dec75060d 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 00c324dd44..1e0d6f4b62 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -84,7 +84,7 @@ test: select_into select_distinct select_distinct_on select_implicit select_havi
 # ----------
 # Another group of parallel tests
 # ----------
-test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index
+test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password func_index incremental_sort
 
 # ----------
 # Another group of parallel tests
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 39c3fa9c85..c43abdf1fc 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -90,6 +90,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..bd66228ada
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,17 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index c60d7d2342..1b05456316 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.

#96

Alexander Korotkov

a.korotkov@postgrespro.ru

over 7 years ago

In reply to: James Coleman (#95)

Re: Re: [HACKERS] [PATCH] Incremental sort

Hi, James!

On Thu, May 31, 2018 at 11:10 PM James Coleman <jtc331@gmail.com> wrote:

I've attached an updated copy of the patch that applies cleanly to
current master.

Thank you for rebasing this patch. Next time sending a patch, please make
sure you've bumped
its version, if even you made no changes there besides to pure rebase.
Otherwise, it would be
hard to distinguish patch versions, because patch files have exactly same
names.

I'd like to note, that I'm going to provide revised version of this patch
to the next commitfest.
After some conversations at PGCon regarding this patch, I got more
confident that providing
separate paths for incremental sorts is right. In the next revision of
this patch, incremental
sort paths would be provided in more cases.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#97

James Coleman

jtc331@gmail.com

over 7 years ago

In reply to: Alexander Korotkov (#96)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

A fairly common planning problem for us is what we call "most recent first" queries; i.e., "the 50 most recent <table> rows for a <foreign key>".

Here's a basic setup:

-- created_at has very high cardinality
create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);

-- technically this data guarantees unique created_at values,
-- but there's no reason it couldn't be modified to have a few
-- random non-unique values to prove the point
insert into foo(owner_fk, created_at)
select i % 100, now() - (i::text || ' minutes')::interval
from generate_series(1, 1000000) t(i);

And here's the naive query to get the results we want:

select *
from foo
where owner_fk = 23
-- pk is only here to disambiguate/guarantee a stable sort
-- in the rare case that there are collisions in the other
-- sort field(s)
order by created_at desc, pk desc
limit 50;

On stock Postgres this ends up being pretty terrible for cases where the fk filter represents a large number of rows, because the planner generates a sort node under the limit node and therefore fetches all matches, sorts them, and then applies the limit. Here's the plan:

Limit (cost=61386.12..61391.95 rows=50 width=16) (actual time=187.814..191.653 rows=50 loops=1)
-> Gather Merge (cost=61386.12..70979.59 rows=82224 width=16) (actual time=187.813..191.647 rows=50 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=60386.10..60488.88 rows=41112 width=16) (actual time=185.639..185.642 rows=42 loops=3)
Sort Key: created_at DESC, pk DESC
Sort Method: top-N heapsort Memory: 27kB
Worker 0: Sort Method: top-N heapsort Memory: 27kB
Worker 1: Sort Method: top-N heapsort Memory: 27kB
-> Parallel Bitmap Heap Scan on foo (cost=3345.24..59020.38 rows=41112 width=16) (actual time=25.150..181.804 rows=33333 loops=3)
Recheck Cond: (owner_fk = 23)
Heap Blocks: exact=18014
-> Bitmap Index Scan on idx_foo_on_owner_and_created_at (cost=0.00..3320.57 rows=98668 width=0) (actual time=16.992..16.992 rows=100000 loops=1)
Index Cond: (owner_fk = 23)
Planning Time: 0.384 ms
Execution Time: 191.704 ms

I have a recursive CTE that implements the algorithm:
- Find first n+1 results
- If result at n+1’s created_at value differs from the n’s value, return first n values.
- If those equal, gather more results until a new created_at value is encountered.
- Sort all results by created_at and a tie-breaker (e.g., pk) and return the first n values.
But nobody wants to use/write that normally (it's quite complex).

This patch solves the problem presented; here's the plan:

Limit (cost=2.70..2.76 rows=50 width=16) (actual time=0.233..0.367 rows=50 loops=1)
-> Incremental Sort (cost=2.70..111.72 rows=98668 width=16) (actual time=0.232..0.362 rows=50 loops=1)
Sort Key: created_at DESC, pk DESC
Presorted Key: created_at
Sort Method: quicksort Memory: 26kB
Sort Groups: 2
-> Index Scan Backward using idx_foo_on_owner_and_created_at on foo (cost=0.56..210640.79 rows=98668 width=16) (actual time=0.054..0.299 rows=65 loops=1)
Index Cond: (owner_fk = 23)
Planning Time: 0.428 ms
Execution Time: 0.393 ms

While check world fails, the only failure appears to be a plan output change in test/isolation/expected/drop-index-concurrently-1.out that just needs to be updated (incremental sort is now used in this plan); I don't see any functionality breakage.

#98

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: James Coleman (#97)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

James Coleman <jtc331@gmail.com> writes:

A fairly common planning problem for us is what we call "most recent first" queries; i.e., "the 50 most recent <table> rows for a <foreign key>".

Here's a basic setup:

-- created_at has very high cardinality
create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);

-- technically this data guarantees unique created_at values,
-- but there's no reason it couldn't be modified to have a few
-- random non-unique values to prove the point
insert into foo(owner_fk, created_at)
select i % 100, now() - (i::text || ' minutes')::interval
from generate_series(1, 1000000) t(i);

And here's the naive query to get the results we want:

select *
from foo
where owner_fk = 23
-- pk is only here to disambiguate/guarantee a stable sort
-- in the rare case that there are collisions in the other
-- sort field(s)
order by created_at desc, pk desc
limit 50;

If you're concerned about the performance of this case, why don't you make
an index that actually matches the query?

regression=# create index on foo (owner_fk, created_at, pk);
CREATE INDEX
regression=# explain analyze select * from foo where owner_fk = 23 order by created_at desc, pk desc limit 50;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..110.92 rows=50 width=16) (actual time=0.151..0.280 rows=50 loops=1)
-> Index Only Scan Backward using foo_owner_fk_created_at_pk_idx on foo (cost=0.42..20110.94 rows=9100 width=16) (actual time=0.146..0.255 rows=50 loops=1)
Index Cond: (owner_fk = 23)
Heap Fetches: 50
Planning Time: 0.290 ms
Execution Time: 0.361 ms
(6 rows)

There may be use-cases for Alexander's patch, but I don't find this
one to be terribly convincing.

regards, tom lane

#99

James Coleman

jtc331@gmail.com

over 7 years ago

In reply to: Tom Lane (#98)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

I disagree because it's not ideal to basically have to append pk to every
index in the system just to get the ability to tie-break when there's
actually very low likelihood of ties anyway.

A similar use case is trying to batch through a result set, in which case
you need a stable sort as well.

The benefit is retaining the generality of indexes (and saving space in
them etc.) while still allowing using them for more specific sorts. Any
time you paginate or batch this way you benefit from this, which in many
applications applies to a very high percentage of indexes.

On Thu, Sep 6, 2018 at 10:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

James Coleman <jtc331@gmail.com> writes:

A fairly common planning problem for us is what we call "most recent

first" queries; i.e., "the 50 most recent <table> rows for a <foreign key>".

Here's a basic setup:

-- created_at has very high cardinality
create table foo(pk serial primary key, owner_fk integer, created_at

timestamp);

create index idx_foo_on_owner_and_created_at on foo(owner_fk,

created_at);

-- technically this data guarantees unique created_at values,
-- but there's no reason it couldn't be modified to have a few
-- random non-unique values to prove the point
insert into foo(owner_fk, created_at)
select i % 100, now() - (i::text || ' minutes')::interval
from generate_series(1, 1000000) t(i);

And here's the naive query to get the results we want:

select *
from foo
where owner_fk = 23
-- pk is only here to disambiguate/guarantee a stable sort
-- in the rare case that there are collisions in the other
-- sort field(s)
order by created_at desc, pk desc
limit 50;

If you're concerned about the performance of this case, why don't you make
an index that actually matches the query?

regression=# create index on foo (owner_fk, created_at, pk);
CREATE INDEX
regression=# explain analyze select * from foo where owner_fk = 23 order
by created_at desc, pk desc limit 50;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..110.92 rows=50 width=16) (actual time=0.151..0.280
rows=50 loops=1)
-> Index Only Scan Backward using foo_owner_fk_created_at_pk_idx on
foo (cost=0.42..20110.94 rows=9100 width=16) (actual time=0.146..0.255
rows=50 loops=1)
Index Cond: (owner_fk = 23)
Heap Fetches: 50
Planning Time: 0.290 ms
Execution Time: 0.361 ms
(6 rows)

There may be use-cases for Alexander's patch, but I don't find this
one to be terribly convincing.

regards, tom lane

#100

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 7 years ago

In reply to: James Coleman (#99)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 09/06/2018 08:04 PM, James Coleman wrote:

I disagree because it's not ideal to basically have to append pk to
every index in the system just to get the ability to tie-break when
there's actually very low likelihood of ties anyway.

A similar use case is trying to batch through a result set, in which
case you need a stable sort as well.

The benefit is retaining the generality of indexes (and saving space in
them etc.) while still allowing using them for more specific sorts. Any
time you paginate or batch this way you benefit from this, which in many
applications applies to a very high percentage of indexes.

I 100% with this.

I see incremental sort as a way to run queries with fewer indexes that
are less query-specific, while still benefiting from them. Which means
lower overhead when writing data, lower disk space usage, and so on.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#101

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Alexander Korotkov (#96)

Re: [HACKERS] [PATCH] Incremental sort

Hi Alexander,

On 06/01/2018 04:22 PM, Alexander Korotkov wrote:

Hi, James!

On Thu, May 31, 2018 at 11:10 PM James Coleman <jtc331@gmail.com
<mailto:jtc331@gmail.com>> wrote:

I've attached an updated copy of the patch that applies cleanly to
current master.

Thank you for rebasing this patch. Next time sending a patch,
please make sure you've bumped its version, if even you made no
changes there besides to pure rebase. Otherwise, it would be hard to
distinguish patch versions, because patch files have exactly same
names.

I'd like to note, that I'm going to provide revised version of this
patch to the next commitfest. After some conversations at PGCon
regarding this patch, I got more confident that providing separate
paths for incremental sorts is right. In the next revision of this
patch, incremental sort paths would be provided in more cases.

Do you plan to submit an updated patch version for the November CF? It
would be good to look at the costing model soon, otherwise it might end
up missing PG12, and that would be unfortunate.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#102

Dmitry Dolgov

9erthalion6@gmail.com

about 7 years ago

In reply to: Tomas Vondra (#100)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Mar 20, 2017 at 10:34 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Please, find rebased patch in the attachment.

It's been a while since this patch was posted. There is already some amount of
feedback in this thread, and the patch itself unfortunately has some conflicts
with the current master. Alexander, do you have any plans about this feature?
For now probably I'll mark it as "Returned with feedback".

#103

Shaun Thomas

shaun.thomas@2ndquadrant.com

almost 7 years ago

In reply to: Dmitry Dolgov (#102)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Another ping on this Incremental Sort patch.

Alexander, you'd noted that you would try to get it into subsequent
Commit Fests with improvements you've been considering, but I don't
see it in anything but 2018-11. Have you abandoned this as a
maintainer? If so, it would be nice to know so someone else can pick
it up.

--
Shaun M Thomas - 2ndQuadrant
PostgreSQL Training, Services and Support
shaun.thomas@2ndquadrant.com | www.2ndQuadrant.com

#104

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Shaun Thomas (#103)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 2018-06-01 14:22:26, Alexander Korotkov wrote:

I'd like to note, that I'm going to provide revised version of this patch
to the next commitfest.
After some conversations at PGCon regarding this patch, I got more
confident that providing separate paths for incremental sorts is right.
In the next revision of this patch, incremental sort paths would be
provided in more cases.

Alexander,

I'm currently rebasing the patch, and if I get some time I'm going to
look into working on it.

Would you be able to provide a bit more detail on the changes you were
hoping to make after conversations you'd had with others? I'm hoping
for any pointers/context you have from those conversations as to what
you felt was necessary to get this change committed.

Thanks,
James Coleman

#105

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: James Coleman (#104)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

I've rebased the patch on master and confirmed make check world passes.

Attachments:

incremental-sort-27.patchapplication/octet-stream; name=incremental-sort-27.patchDownload

From 8007bc1eee58b2e20ad1b766f5524c02f07b9483 Mon Sep 17 00:00:00 2001
From: jcoleman <james.coleman@getbraintree.com>
Date: Fri, 31 May 2019 14:40:17 +0000
Subject: [PATCH] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases when the
input is already sorted by a prefix of the sort keys.  For example when a sort
by (key1, key2 ... keyN) is requested, and the input is already sorted by
(key1, key2 ... keyM), M < N, we can divide the input into groups where keys
(key1, ... keyM) are equal, and only sort on the remaining columns.

Incremental sort can give a huge benefit when LIMIT clause is specified,
then it wouldn't even have to read the whole input.  Another huge benefit
of incremental sort is that sorting data in small groups may help to evade
using disk during sort.  However, on small datasets which fit into memory
incremental sort may be slightly slower than full sort.  That was reflected
in costing.

This patch implements very basic usage of incremental sort: it gets used
only in create_ordered_paths(), while it sort can help in much more use cases,
for instance in merge join.  But latter would require much more changes in
optimizer and postponed for further releases.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/commands/explain.c                | 138 +++-
 src/backend/executor/Makefile                 |   4 +-
 src/backend/executor/execAmi.c                |  13 +
 src/backend/executor/execParallel.c           |  18 +
 src/backend/executor/execProcnode.c           |  10 +
 src/backend/executor/nodeIncrementalSort.c    | 682 ++++++++++++++++++
 src/backend/executor/nodeSort.c               |   3 +-
 src/backend/nodes/copyfuncs.c                 |  49 +-
 src/backend/nodes/outfuncs.c                  |  25 +-
 src/backend/nodes/readfuncs.c                 |  37 +-
 src/backend/optimizer/path/allpaths.c         |   4 +
 src/backend/optimizer/path/costsize.c         | 214 +++++-
 src/backend/optimizer/path/pathkeys.c         |  61 +-
 src/backend/optimizer/plan/createplan.c       | 129 +++-
 src/backend/optimizer/plan/planner.c          |  71 +-
 src/backend/optimizer/plan/setrefs.c          |   1 +
 src/backend/optimizer/plan/subselect.c        |   1 +
 src/backend/optimizer/util/pathnode.c         |  51 ++
 src/backend/utils/misc/guc.c                  |   9 +
 src/backend/utils/sort/tuplesort.c            | 188 ++++-
 src/include/executor/nodeIncrementalSort.h    |  30 +
 src/include/nodes/execnodes.h                 |  54 ++
 src/include/nodes/nodes.h                     |   3 +
 src/include/nodes/pathnodes.h                 |   9 +
 src/include/nodes/plannodes.h                 |  11 +
 src/include/optimizer/cost.h                  |  10 +
 src/include/optimizer/pathnode.h              |   6 +
 src/include/optimizer/paths.h                 |   2 +
 src/include/utils/tuplesort.h                 |   2 +
 .../expected/drop-index-concurrently-1.out    |   5 +-
 .../regress/expected/incremental_sort.out     |  45 ++
 .../regress/expected/partition_aggregate.out  |   2 +
 src/test/regress/expected/sysviews.out        |   3 +-
 src/test/regress/parallel_schedule            |   2 +-
 src/test/regress/serial_schedule              |   1 +
 src/test/regress/sql/incremental_sort.sql     |  17 +
 src/test/regress/sql/partition_aggregate.sql  |   2 +
 38 files changed, 1808 insertions(+), 118 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..9ba845b53a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4368,6 +4368,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..b3b519f927 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1215,6 +1219,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1841,6 +1848,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2175,12 +2188,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2191,7 +2221,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2215,7 +2245,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2284,7 +2314,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2341,7 +2371,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2354,13 +2384,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2400,9 +2431,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2612,6 +2647,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->group_count);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->group_count, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		group_count;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			group_count = incrsortstate->shared_info->sinfo[n].group_count;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, group_count);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, group_count, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1f18e5d3a2..8680e7d911 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -254,6 +255,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -559,8 +564,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 0ab9a9939c..5810ccd329 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -955,6 +964,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1015,6 +1025,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1301,6 +1314,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c227282975..9e3e875d32 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -694,6 +700,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..6e9cd18100
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,682 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, parcitularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *tupleSlot)
+{
+	int presortedCols, i;
+	TupleTableSlot *group_pivot = node->group_pivot;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * We do assume the input is sorted by keys (0, ... n), which means
+	 * the tail keys are more likely to change. So we do the comparison
+	 * from the end, to minimize the number of function calls.
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(group_pivot, attno, &isnullA);
+		datumB = slot_getattr(tupleSlot, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples.  However, in the case
+ * of bounded sort where bound is less than DEFAULT_MIN_GROUP_SIZE we start
+ * looking for the new group when bound is done.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from the current sorted group set if available.
+	 * If there are no more tuples in the current group, we need to try
+	 * to fetch more tuples from the input and build another group.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * First time through or no tuples in the current group. Read next
+	 * batch of tuples from the outer plan and pass them to tuplesort.c.
+	 * Subsequent calls just fetch tuples from tuplesort, until the group
+	 * is exhausted, at which point we build the next group.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/*
+	 * Initialize tuplesort module (needed only before the first group).
+	 */
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		SO1_printf("ExecIncrementalSort: %s\n",
+				   "calling tuplesort_begin_heap");
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least minGroupSize size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->group_count++;
+
+	/*
+	 * Calculate remaining bound for bounded sort and minimal group size
+	 * accordingly.
+	 */
+	if (node->bounded)
+	{
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+		minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, node->bound - node->bound_Done);
+	}
+	else
+	{
+		minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+	}
+
+	/* If we got a leftover tuple from the last group, pass it to tuplesort. */
+	if (!TupIsNull(node->group_pivot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->group_pivot);
+		ExecClearTuple(node->group_pivot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/*
+		 * Accumulate the next group of presorted tuples for tuplesort.
+		 * We always accumulate at least minGroupSize tuples, and only
+		 * then we start to compare the prefix keys.
+		 *
+		 * The last tuple is kept as a pivot, so that we can determine if
+		 * the subsequent tuples have the same prefix key (same group).
+		 */
+		if (nTuples < minGroupSize)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Keep the last tuple in minimal group as a pivot. */
+			if (nTuples == minGroupSize - 1)
+				ExecCopySlot(node->group_pivot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/*
+			 * Iterate while presorted cols are the same as in the pivot
+			 * tuple.
+			 *
+			 * After accumulating at least minGroupSize tuples (we don't
+			 * know how many groups are there in that set), we need to keep
+			 * accumulating until we reach the end of the group. Only then
+			 * we can do the sort and output all the tuples.
+			 *
+			 * We compare the prefix keys to the pivot - if the prefix keys
+			 * are the same the tuple belongs to the same group, so we pass
+			 * it to the tuplesort.
+			 *
+			 * If the prefix differs, we've reached the end of the group. We
+			 * need to keep the last tuple, so we copy it into the pivot slot
+			 * (it does not serve as pivot, though).
+			 */
+			if (isCurrentGroup(node, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].group_count =
+															node->group_count;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 92855278ad..3ea1b1bca1 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 78deade89b..de27b06e15 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -921,6 +921,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -932,13 +950,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4900,6 +4934,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 237598e110..83a063960f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -830,10 +830,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -843,6 +841,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3755,6 +3771,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..9e0d42322c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2114,12 +2114,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2128,6 +2129,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2761,6 +2788,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b7723481b0..3efc807164 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3884,6 +3884,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..7f820e7351 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,183 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
 
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		output_tuples,
+				output_groups,
+				group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	if (!enable_sort)
+		startup_cost += disable_cost;
+
+	if (!enable_incrementalsort)
+		startup_cost += disable_cost;
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+	{
+		output_tuples = limit_tuples;
+		output_groups = floor(output_tuples / group_tuples) + 1;
+	}
+	else
+	{
+		output_tuples = input_tuples;
+		output_groups = input_groups;
+	}
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (output_groups - 1)
+		+ group_input_run_cost * (output_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * output_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * output_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 08b5061612..454c61e1d8 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -332,6 +332,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1791,19 +1836,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int	n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none.
+	 * Any leading common pathkeys could be useful for ordering because
+	 * we can use the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5adfed..a02b6ee3dd 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -241,6 +243,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -255,6 +261,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -457,6 +465,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1988,6 +2001,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5050,17 +5089,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
 
-	cost_sort(&sort_path, root, NIL,
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5633,9 +5679,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	node = makeNode(Sort);
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5649,6 +5698,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -5995,6 +6075,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6729,6 +6845,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc7f4..38905501d9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4917,8 +4917,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4957,29 +4957,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can take
+				 * advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index dc11f098e0..878cb6b934 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -648,6 +648,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index efd0fbc21c..41a5e18195 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2686,6 +2686,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..91066b238c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2777,6 +2777,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a68..63ce07a00c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -942,6 +942,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b8e67899e..5cd35a0221 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +252,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,16 +1250,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1293,7 +1316,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2590,8 +2717,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2641,7 +2767,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3265,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 64122bc1e3..eccbb020b8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1922,6 +1922,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfo	fcinfo; /* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1950,6 +1964,46 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 4e2fb39105..0500a3199f 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4b7703d478..ebfee257f3 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1615,6 +1615,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 70f8b8e22b..f9baee6495 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -762,6 +762,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9b6bdbc518..7c05d6cf71 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e70d6a3f18..fe5339ea2e 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -186,6 +186,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..e7a40cec3f 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -183,6 +183,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 4521de18e1..ac3377dccc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..fa7fb23319
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,45 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 10349ec29c..5f17afe0eb 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index f23fe8d870..1226a063ef 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index ca200eb599..e7e80a105c 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -89,6 +89,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b64bed7e60
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,17 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.20.1

#106

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#105)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Hi James,

On Fri, May 31, 2019 at 03:51:57PM -0400, James Coleman wrote:

I've rebased the patch on master and confirmed make check world passes.

Thanks for the rebase! I think the patch is in pretty good shape - I'm
sure we'll find ways to make it more efficient etc. but IMO that's fine
and should not prevent getting it committed.

The planning/costing logic may need further discussion and improvements,
though. IIRC this was the main reason why the patch missed PG11, because
at that point it simply used incremental sort whenever the input was
already presorted by a pathkey prefix, but that may be slower than regular
sort in some cases (unexpectedly large groups, etc.).

I see the current patch partially improves this by removing the creating
both paths (full and and incremental sort). That'd good, because it means
the decision is cost-based (as it should be). The question however is how
accurate the costing model is - and per discussion in the thread, it may
need some improvements, to handle skewed distributions better.

Currently, the costing logic (cost_incremental_sort) assumes all groups
have the same size, which is fine for uniform distributions. But for
skewed distributions, that may be an issue.

Consider for example a table with 1M rows, two columns, 100 groups in each
column, and index on the first one.

CREATE table t (a INT, b INT);

INSERT INTO t SELECT 100*random(), 100*random()
FROM generate_series(1,1000000);

Now, let's do a simple limit query to find the first row:

SELECT * FROM t ORDER BU a, b LIMIT 1;

In this case the current costing logic is fine - the groups are close to
average size, and we only need to sort the first group, i.e. 1% of data.

Now, let's say the first group is much larger:

INSERT INTO t SELECT 0, 100*random()
FROM generate_series(1,900000) s(i);

INSERT INTO t SELECT 100*random(), 100*random()
FROM generate_series(1,100000);

That is, the first group is roughly 90% of data, but the number of groups
is still the same. But we need to scan 90% of data. But the average group
size is still the same, so the current cost model is oblivious to this.

But I think we can improve this by looking at the MCV lists (either
per-column or multi-column) and see if some groups are much larger, and
consider that when computing the costs.

In particular, I think we should estimate the size of the first group,
because that's important for startup cost - we need to process the whole
first group before producing the first tuple, and that matters for LIMIT
queries etc.

For example, let's say the first (already sorted) column has a MCV. Then
we can see how large the first group (by valule, not frequency) is, and
use that instead of the average group size. E.g. in the above example we'd
know the first group is ~90%.

And we could do the same for multiple columns, either by looking at
multi-column MCV lists (if there's one), or by using minimum from each
per-column MCV lists.

Of course, these are only the groups that made it to the MCV list, and
there may be other (smaller) groups before this large one. For example
there could be a group with "-1" value and a single row.

For a moment I thought we could/should look at the histogram, becase that
could tell us if there are groups "before" the first MCV one, but I don't
think we should do that, for two reasons. Firstly, rare values may not get
to the histogram anyway, which makes this rather unreliable and might
introduce sudden plan changes, because the cost would vary wildly
depending on whether we happened to sample the rare row or not. And
secondly, the rare row may be easily filtered out by a WHERE condition or
something, at which point we'll have to deal with the large group anyway.

So I think we should look at the MCV list, and use that information when
computing the startup/total cost. I think using the first/largest group to
compute the startup cost, and the average group size for total cost would
do the trick.

I don't think we can do much better than this during planning. There will
inevitably be cases where the costing model will push us to do the wrong
thing, in either direction. The question is how serious issue that is, and
whether we could remedy that during execution somehow.

When we "incorrectly" pick full sort (when the incremental sort would be
faster), that's obviously sad but I think it's mostly fine because it's
not a regression.

The opposite direction (picking incremental sort, while full sort being
faster) is probably more serious, because it's a regression between
releases.

I don't think we can fully fix that by refining the cost model. We have
two basic options:

1) Argue/show that this is not an issue in practice, because (a) such
cases are very rare, and/or (b) the regression is limited. In short, the
benefits of the patch outweight the losses.

2) Provide some fallback at execution time. For example, we might watch
the size of the group, and if we run into an unexpectedly large one we
might abandon the incremental sort and switch to a "full sort" mode.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#107

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#106)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Thanks for the rebase! I think the patch is in pretty good shape - I'm
sure we'll find ways to make it more efficient etc. but IMO that's fine
and should not prevent getting it committed.

Thank you for the in-depth review!

Currently, the costing logic (cost_incremental_sort) assumes all groups
have the same size, which is fine for uniform distributions. But for
skewed distributions, that may be an issue.

Consider for example a table with 1M rows, two columns, 100 groups in each
column, and index on the first one.

CREATE table t (a INT, b INT);

INSERT INTO t SELECT 100*random(), 100*random()
FROM generate_series(1,1000000);

Now, let's do a simple limit query to find the first row:

SELECT * FROM t ORDER BU a, b LIMIT 1;

In this case the current costing logic is fine - the groups are close to
average size, and we only need to sort the first group, i.e. 1% of data.

Now, let's say the first group is much larger:

INSERT INTO t SELECT 0, 100*random()
FROM generate_series(1,900000) s(i);

INSERT INTO t SELECT 100*random(), 100*random()
FROM generate_series(1,100000);

That is, the first group is roughly 90% of data, but the number of groups
is still the same. But we need to scan 90% of data. But the average group
size is still the same, so the current cost model is oblivious to this.

Thinking out loud here: the current implementation doesn't guarantee
that sort groups always have the same prefix column values because
(from code comments) "Sorting many small groups with tuplesort is
inefficient). While this seems a reasonable optimization, I think it's
possible that thinking steered away from an optimization in the
opposite direction. Perhaps we should always track whether or not all
prefix tuples are the same (currently we only do that after reaching
DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to
have tuplesort only care about the non-prefix columns (where currently
it has to sort on all pathkey columns even though for a large group
the prefix columns are guaranteed to be equal).

Essentially I'm trying to think of ways that would get us to
comparable performance with regular sort in the case of large batch
sizes.

One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the
case where you have a very small group and then a very large batch,
we'd lose the ability to optimize in the above way. That makes me
wonder if we shouldn't intentionally optimize for the possibility of
large batch sizes, since a little extra expense per group/tuple is
more likely to be a non-concern with small groups anyway when there
are large numbers of input tuples but a relatively small limit.

Thoughts?

So I think we should look at the MCV list, and use that information when
computing the startup/total cost. I think using the first/largest group to
compute the startup cost, and the average group size for total cost would
do the trick.

I think this sounds very reasonable.

I don't think we can do much better than this during planning. There will
inevitably be cases where the costing model will push us to do the wrong
thing, in either direction. The question is how serious issue that is, and
whether we could remedy that during execution somehow.

When we "incorrectly" pick full sort (when the incremental sort would be
faster), that's obviously sad but I think it's mostly fine because it's
not a regression.

The opposite direction (picking incremental sort, while full sort being
faster) is probably more serious, because it's a regression between
releases.

I don't think we can fully fix that by refining the cost model. We have
two basic options:

1) Argue/show that this is not an issue in practice, because (a) such
cases are very rare, and/or (b) the regression is limited. In short, the
benefits of the patch outweight the losses.

My comments above go in this direction. If we can improve performance
in the worst case, I think it's plausible this concern becomes a
non-issue.

2) Provide some fallback at execution time. For example, we might watch
the size of the group, and if we run into an unexpectedly large one we
might abandon the incremental sort and switch to a "full sort" mode.

Are there good examples of our doing this in other types of nodes
(whether the fallback is an entirely different algorithm/node type)? I
like this idea in theory, but I also think it's likely it would add a
very significant amount of complexity. The other problem is knowing
where to draw the line: you end up creating these kinds of cliffs
where pulling one more tuple through the incremental sort would give
you your batch and result in not having to pull many more tuples in a
regular sort node, but the fallback logic kicks in anyway.

Unrelated to all of the above: if I read the patch properly it
intentionally excludes backwards scanning. I don't see any particular
reason why that ought to be the case, and it seems like an odd
limitation for the feature should it be merged. Should that be a
blocker to merging?

James Coleman

#108

Rafia Sabih

rafia.pghackers@gmail.com

over 6 years ago

In reply to: James Coleman (#107)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, 3 Jun 2019 at 15:39, James Coleman <jtc331@gmail.com> wrote:

On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Currently, the costing logic (cost_incremental_sort) assumes all groups
have the same size, which is fine for uniform distributions. But for
skewed distributions, that may be an issue.

Consider for example a table with 1M rows, two columns, 100 groups in each
column, and index on the first one.

CREATE table t (a INT, b INT);

INSERT INTO t SELECT 100*random(), 100*random()
FROM generate_series(1,1000000);

Now, let's do a simple limit query to find the first row:

SELECT * FROM t ORDER BU a, b LIMIT 1;

In this case the current costing logic is fine - the groups are close to
average size, and we only need to sort the first group, i.e. 1% of data.

Now, let's say the first group is much larger:

INSERT INTO t SELECT 0, 100*random()
FROM generate_series(1,900000) s(i);

INSERT INTO t SELECT 100*random(), 100*random()
FROM generate_series(1,100000);

That is, the first group is roughly 90% of data, but the number of groups
is still the same. But we need to scan 90% of data. But the average group
size is still the same, so the current cost model is oblivious to this.

Thinking out loud here: the current implementation doesn't guarantee
that sort groups always have the same prefix column values because
(from code comments) "Sorting many small groups with tuplesort is
inefficient). While this seems a reasonable optimization, I think it's
possible that thinking steered away from an optimization in the
opposite direction. Perhaps we should always track whether or not all
prefix tuples are the same (currently we only do that after reaching
DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to
have tuplesort only care about the non-prefix columns (where currently
it has to sort on all pathkey columns even though for a large group
the prefix columns are guaranteed to be equal).

+1 for passing only the non-prefix columns to tuplesort.

Essentially I'm trying to think of ways that would get us to
comparable performance with regular sort in the case of large batch
sizes.

One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the
case where you have a very small group and then a very large batch,
we'd lose the ability to optimize in the above way. That makes me
wonder if we shouldn't intentionally optimize for the possibility of
large batch sizes, since a little extra expense per group/tuple is
more likely to be a non-concern with small groups anyway when there
are large numbers of input tuples but a relatively small limit.

What about using the knowledge of MCV here, if we know the next value
is in MCV list then take the overhead of sorting this small group
alone and then leverage the optimization for the larger group, by
passing only the non-prefix columns.

Thoughts?

So I think we should look at the MCV list, and use that information when
computing the startup/total cost. I think using the first/largest group to
compute the startup cost, and the average group size for total cost would
do the trick.

I think this sounds very reasonable.

I don't think we can do much better than this during planning. There will
inevitably be cases where the costing model will push us to do the wrong
thing, in either direction. The question is how serious issue that is, and
whether we could remedy that during execution somehow.

When we "incorrectly" pick full sort (when the incremental sort would be
faster), that's obviously sad but I think it's mostly fine because it's
not a regression.

The opposite direction (picking incremental sort, while full sort being
faster) is probably more serious, because it's a regression between
releases.

I don't think we can fully fix that by refining the cost model. We have
two basic options:

1) Argue/show that this is not an issue in practice, because (a) such
cases are very rare, and/or (b) the regression is limited. In short, the
benefits of the patch outweight the losses.

My comments above go in this direction. If we can improve performance
in the worst case, I think it's plausible this concern becomes a
non-issue.

2) Provide some fallback at execution time. For example, we might watch
the size of the group, and if we run into an unexpectedly large one we
might abandon the incremental sort and switch to a "full sort" mode.

Are there good examples of our doing this in other types of nodes
(whether the fallback is an entirely different algorithm/node type)? I
like this idea in theory, but I also think it's likely it would add a
very significant amount of complexity. The other problem is knowing
where to draw the line: you end up creating these kinds of cliffs
where pulling one more tuple through the incremental sort would give
you your batch and result in not having to pull many more tuples in a
regular sort node, but the fallback logic kicks in anyway.

What about having some simple mechanism for this, like if we encounter
the group with more tuples than the one estimated then simply switch
to normal sort for the remaining tuples, as the estimates does not
hold true anyway. Atleast this will not give issues of having
regressions of incremental sort being too bad than the normal sort.
I mean having something like this, populate the tuplesortstate and
keep on counting the number of tuples in a group, if they are within
the budget call tuplesort_performsort, otherwise put all the tuples in
the tuplesort and then call tuplesort_performsort. We may have an
additional field in IncrementalSortState to save the estimated size of
each group. I am assuming that we use MCV lists to approximate better
the group sizes as suggested above by Tomas.

Unrelated to all of the above: if I read the patch properly it
intentionally excludes backwards scanning. I don't see any particular
reason why that ought to be the case, and it seems like an odd
limitation for the feature should it be merged. Should that be a
blocker to merging?

Regarding this, I came across this,
/*
* Incremental sort can't be used with either EXEC_FLAG_REWIND,
* EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
* bucket in tuplesortstate.
*/
I think that is quite understandable. How are you planning to support
backwards scan for this? In other words, when will incremental sort be
useful for backward scan.

On a different note, I can't stop imagining this operator on the lines
similar to parallel-append, wherein multiple workers can sort the
different groups independently at the same time.

--
Regards,
Rafia Sabih

#109

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Rafia Sabih (#108)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Jun 5, 2019 at 12:14 PM Rafia Sabih <rafia.pghackers@gmail.com> wrote:

2) Provide some fallback at execution time. For example, we might watch
the size of the group, and if we run into an unexpectedly large one we
might abandon the incremental sort and switch to a "full sort" mode.

Are there good examples of our doing this in other types of nodes
(whether the fallback is an entirely different algorithm/node type)? I
like this idea in theory, but I also think it's likely it would add a
very significant amount of complexity. The other problem is knowing
where to draw the line: you end up creating these kinds of cliffs
where pulling one more tuple through the incremental sort would give
you your batch and result in not having to pull many more tuples in a
regular sort node, but the fallback logic kicks in anyway.

What about having some simple mechanism for this, like if we encounter
the group with more tuples than the one estimated then simply switch
to normal sort for the remaining tuples, as the estimates does not
hold true anyway. Atleast this will not give issues of having
regressions of incremental sort being too bad than the normal sort.
I mean having something like this, populate the tuplesortstate and
keep on counting the number of tuples in a group, if they are within
the budget call tuplesort_performsort, otherwise put all the tuples in
the tuplesort and then call tuplesort_performsort. We may have an
additional field in IncrementalSortState to save the estimated size of
each group. I am assuming that we use MCV lists to approximate better
the group sizes as suggested above by Tomas.

I think the first thing to do is get some concrete numbers on performance if we:

1. Only sort one group at a time.
2. Update the costing to prefer traditional sort unless we have very
high confidence we'll win with incremental sort.

It'd be nice not to have to add additional complexity if at all possible.

Unrelated to all of the above: if I read the patch properly it
intentionally excludes backwards scanning. I don't see any particular
reason why that ought to be the case, and it seems like an odd
limitation for the feature should it be merged. Should that be a
blocker to merging?

Regarding this, I came across this,
/*
* Incremental sort can't be used with either EXEC_FLAG_REWIND,
* EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
* bucket in tuplesortstate.
*/
I think that is quite understandable. How are you planning to support
backwards scan for this? In other words, when will incremental sort be
useful for backward scan.

For some reason I was thinking we'd need it to support backwards scans
to be able to handle DESC sort on the index, but I've tested and
confirmed that already works. I suppose that's because the index scan
provides that ordering and the sort node doesn't need to reverse the
direction of what's provided to it. That's not particularly obvious to
someone newer to the codebase; I'm not sure if that's documented
anywhere.

On a different note, I can't stop imagining this operator on the lines
similar to parallel-append, wherein multiple workers can sort the
different groups independently at the same time.

That is an interesting idea. I suppose it'd be particularly valuable
if somehow there a node that was generating each batch in parallel
already, though I'm not sure at first though what kind of query or
node would result in that. I also wonder if (assuming that weren't the
case) it would be much of an improvement since a single thread would
have to generate each batch anyway; I'm not sure if the overhead of
farming each batch out to a worker would actually be useful or if the
real blocker is the base scan.

At the very least it's an interesting idea.

---

I've been writing down notes here, and I realized that my test case
from far upthread is actually a useful setup to see how much overhead
is involved in sorting each batch individually, since it sets up data
with each batch only containing 1 tuple. That particular case is one
we could easily optimize anyway in the code and skip sorting
altogether -- might be a useful enhancement.

I hope to do some more testing and then report back with the results.

James Coleman

#110

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#109)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:

On Wed, Jun 5, 2019 at 12:14 PM Rafia Sabih <rafia.pghackers@gmail.com> wrote:

2) Provide some fallback at execution time. For example, we might watch
the size of the group, and if we run into an unexpectedly large one we
might abandon the incremental sort and switch to a "full sort" mode.

Are there good examples of our doing this in other types of nodes
(whether the fallback is an entirely different algorithm/node type)? I
like this idea in theory, but I also think it's likely it would add a
very significant amount of complexity. The other problem is knowing
where to draw the line: you end up creating these kinds of cliffs
where pulling one more tuple through the incremental sort would give
you your batch and result in not having to pull many more tuples in a
regular sort node, but the fallback logic kicks in anyway.

What about having some simple mechanism for this, like if we encounter
the group with more tuples than the one estimated then simply switch
to normal sort for the remaining tuples, as the estimates does not
hold true anyway. Atleast this will not give issues of having
regressions of incremental sort being too bad than the normal sort.
I mean having something like this, populate the tuplesortstate and
keep on counting the number of tuples in a group, if they are within
the budget call tuplesort_performsort, otherwise put all the tuples in
the tuplesort and then call tuplesort_performsort. We may have an
additional field in IncrementalSortState to save the estimated size of
each group. I am assuming that we use MCV lists to approximate better
the group sizes as suggested above by Tomas.

I think the first thing to do is get some concrete numbers on performance if we:

1. Only sort one group at a time.
2. Update the costing to prefer traditional sort unless we have very
high confidence we'll win with incremental sort.

It'd be nice not to have to add additional complexity if at all possible.

+1 to that

Unrelated to all of the above: if I read the patch properly it
intentionally excludes backwards scanning. I don't see any particular
reason why that ought to be the case, and it seems like an odd
limitation for the feature should it be merged. Should that be a
blocker to merging?

Regarding this, I came across this,
/*
* Incremental sort can't be used with either EXEC_FLAG_REWIND,
* EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
* bucket in tuplesortstate.
*/
I think that is quite understandable. How are you planning to support
backwards scan for this? In other words, when will incremental sort be
useful for backward scan.

For some reason I was thinking we'd need it to support backwards scans
to be able to handle DESC sort on the index, but I've tested and
confirmed that already works. I suppose that's because the index scan
provides that ordering and the sort node doesn't need to reverse the
direction of what's provided to it. That's not particularly obvious to
someone newer to the codebase; I'm not sure if that's documented
anywhere.

Yeah, backward scans are not about ASC/DESC, it's about being able to walk
back through the result. And we can't do that with incremental sorts
without materialization.

On a different note, I can't stop imagining this operator on the lines
similar to parallel-append, wherein multiple workers can sort the
different groups independently at the same time.

That is an interesting idea. I suppose it'd be particularly valuable
if somehow there a node that was generating each batch in parallel
already, though I'm not sure at first though what kind of query or
node would result in that. I also wonder if (assuming that weren't the
case) it would be much of an improvement since a single thread would
have to generate each batch anyway; I'm not sure if the overhead of
farming each batch out to a worker would actually be useful or if the
real blocker is the base scan.

At the very least it's an interesting idea.

I kinda doubt that'd be very valuable. Or more precisely, we kinda already
have that capability because we can do things like this:

-> Gather Merge
-> Sort
-> ... scan ...

so I imagine we'd just do an Incremental Sort here. Granted, it's not
distributed by prefix groups (I assume that's what you mean by batches
here), but I don't think that's a big problem.

---

I've been writing down notes here, and I realized that my test case
from far upthread is actually a useful setup to see how much overhead
is involved in sorting each batch individually, since it sets up data
with each batch only containing 1 tuple. That particular case is one
we could easily optimize anyway in the code and skip sorting
altogether -- might be a useful enhancement.

I hope to do some more testing and then report back with the results.

James Coleman

OK.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#111

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Rafia Sabih (#108)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Jun 05, 2019 at 06:14:14PM +0200, Rafia Sabih wrote:

On Mon, 3 Jun 2019 at 15:39, James Coleman <jtc331@gmail.com> wrote:

On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Currently, the costing logic (cost_incremental_sort) assumes all groups
have the same size, which is fine for uniform distributions. But for
skewed distributions, that may be an issue.

Consider for example a table with 1M rows, two columns, 100 groups in each
column, and index on the first one.

CREATE table t (a INT, b INT);

INSERT INTO t SELECT 100*random(), 100*random()
FROM generate_series(1,1000000);

Now, let's do a simple limit query to find the first row:

SELECT * FROM t ORDER BU a, b LIMIT 1;

In this case the current costing logic is fine - the groups are close to
average size, and we only need to sort the first group, i.e. 1% of data.

Now, let's say the first group is much larger:

INSERT INTO t SELECT 0, 100*random()
FROM generate_series(1,900000) s(i);

INSERT INTO t SELECT 100*random(), 100*random()
FROM generate_series(1,100000);

That is, the first group is roughly 90% of data, but the number of groups
is still the same. But we need to scan 90% of data. But the average group
size is still the same, so the current cost model is oblivious to this.

Thinking out loud here: the current implementation doesn't guarantee
that sort groups always have the same prefix column values because
(from code comments) "Sorting many small groups with tuplesort is
inefficient). While this seems a reasonable optimization, I think it's
possible that thinking steered away from an optimization in the
opposite direction. Perhaps we should always track whether or not all
prefix tuples are the same (currently we only do that after reaching
DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to
have tuplesort only care about the non-prefix columns (where currently
it has to sort on all pathkey columns even though for a large group
the prefix columns are guaranteed to be equal).

+1 for passing only the non-prefix columns to tuplesort.

Essentially I'm trying to think of ways that would get us to
comparable performance with regular sort in the case of large batch
sizes.

One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the
case where you have a very small group and then a very large batch,
we'd lose the ability to optimize in the above way. That makes me
wonder if we shouldn't intentionally optimize for the possibility of
large batch sizes, since a little extra expense per group/tuple is
more likely to be a non-concern with small groups anyway when there
are large numbers of input tuples but a relatively small limit.

What about using the knowledge of MCV here, if we know the next value
is in MCV list then take the overhead of sorting this small group
alone and then leverage the optimization for the larger group, by
passing only the non-prefix columns.

Not sure. It very much depends on how expensive the comparisons are for
that particular data type. If the comparisons are cheap, then I'm not sure
it matters very much whether the group is small or large. For expensive
comparison it may not be a win either, because we need to search the MCV
lists whenever the group changes.

I guess we'll need to make some benchmarks and see if it's a win or not.

Thoughts?

So I think we should look at the MCV list, and use that information when
computing the startup/total cost. I think using the first/largest group to
compute the startup cost, and the average group size for total cost would
do the trick.

+1

I think this sounds very reasonable.

I don't think we can do much better than this during planning. There will
inevitably be cases where the costing model will push us to do the wrong
thing, in either direction. The question is how serious issue that is, and
whether we could remedy that during execution somehow.

When we "incorrectly" pick full sort (when the incremental sort would be
faster), that's obviously sad but I think it's mostly fine because it's
not a regression.

The opposite direction (picking incremental sort, while full sort being
faster) is probably more serious, because it's a regression between
releases.

I don't think we can fully fix that by refining the cost model. We have
two basic options:

1) Argue/show that this is not an issue in practice, because (a) such
cases are very rare, and/or (b) the regression is limited. In short, the
benefits of the patch outweight the losses.

My comments above go in this direction. If we can improve performance
in the worst case, I think it's plausible this concern becomes a
non-issue.

2) Provide some fallback at execution time. For example, we might watch
the size of the group, and if we run into an unexpectedly large one we
might abandon the incremental sort and switch to a "full sort" mode.

Are there good examples of our doing this in other types of nodes
(whether the fallback is an entirely different algorithm/node type)? I
like this idea in theory, but I also think it's likely it would add a
very significant amount of complexity. The other problem is knowing
where to draw the line: you end up creating these kinds of cliffs
where pulling one more tuple through the incremental sort would give
you your batch and result in not having to pull many more tuples in a
regular sort node, but the fallback logic kicks in anyway.

I don't think we have nodes where we'd switch to an entirely different
algorithm - say from hash-join to nested-loop join (as is often proposed
as a solution for excessive memory consumption). That's obviously very
complex thing to implement.

But I don't think that's the type of fallback we'd need here - IMO it's
more similar to switching from in-memory to on-disk sort. Essentially,
we'd need to disable the extra logic (detecting the prefix grouping), and
just stash all the remaining tuples into tuplesort and do regular sort.

Of course, I haven't actually implemented this, so maybe it's trickier.

What about having some simple mechanism for this, like if we encounter
the group with more tuples than the one estimated then simply switch
to normal sort for the remaining tuples, as the estimates does not
hold true anyway. Atleast this will not give issues of having
regressions of incremental sort being too bad than the normal sort.
I mean having something like this, populate the tuplesortstate and
keep on counting the number of tuples in a group, if they are within
the budget call tuplesort_performsort, otherwise put all the tuples in
the tuplesort and then call tuplesort_performsort. We may have an
additional field in IncrementalSortState to save the estimated size of
each group. I am assuming that we use MCV lists to approximate better
the group sizes as suggested above by Tomas.

Maybe. I suggest we try to implement the simplest solution first, trying
to do as much as possible during planning, and then try to be smart at
execution time only if necessary.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#112

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#111)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Apr 7, 2018 at 4:56 PM, Alexander Korotkov <
a(dot)korotkov(at)postgrespro(dot)ru> wrote:

I agree with that. For bounded sort, attached patch now selects minimal
group
size as Min(DEFAULT_MIN_GROUP_SIZE, bound). That should improve
"LIMIT small_number" case.

As I was working on some benchmarking I noticed that incremental sort
never seemed to switch into the top-n heapsort mode, which meant for
very large groups it significantly underperformed a regular sort since
it would have to spill to disk every time. Perhaps this indicates some
missing tests also.

I tracked that down to a missing case for IncrementalSortState in
ExecSetTupleBound and have updated the patched to correct the issue
(and confirmed it now properly switches sort modes).

That also means the optimization of choosing the min group size based
on bounds (if available) wasn't previously working.

I also haven't seen incremental sort used in any parallel plans,
though there seems to be some code intended to support it. I haven't
dug into really at all yet though, so can't comment further.

I'm attaching updated patch, and will reply separately with more
detailed comments on my current benchmarking work.

James Coleman

Attachments:

incremental-sort-28.patchapplication/octet-stream; name=incremental-sort-28.patchDownload

commit 9abae75ee93124355e5007ee9abeff1e179e78e0
Author: jcoleman <james.coleman@getbraintree.com>
Date:   Fri May 31 14:40:17 2019 +0000

    Implement incremental sort
    
    Incremental sort is an optimized variant of multikey sort for cases when the
    input is already sorted by a prefix of the sort keys.  For example when a sort
    by (key1, key2 ... keyN) is requested, and the input is already sorted by
    (key1, key2 ... keyM), M < N, we can divide the input into groups where keys
    (key1, ... keyM) are equal, and only sort on the remaining columns.
    
    Incremental sort can give a huge benefit when LIMIT clause is specified,
    then it wouldn't even have to read the whole input.  Another huge benefit
    of incremental sort is that sorting data in small groups may help to evade
    using disk during sort.  However, on small datasets which fit into memory
    incremental sort may be slightly slower than full sort.  That was reflected
    in costing.
    
    This patch implements very basic usage of incremental sort: it gets used
    only in create_ordered_paths(), while it sort can help in much more use cases,
    for instance in merge join.  But latter would require much more changes in
    optimizer and postponed for further releases.
    
    Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..9ba845b53a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4368,6 +4368,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..b3b519f927 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1215,6 +1219,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1841,6 +1848,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2175,12 +2188,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2191,7 +2221,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2215,7 +2245,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2284,7 +2314,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2341,7 +2371,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2354,13 +2384,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2400,9 +2431,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2612,6 +2647,95 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->tuplesortstate != NULL)
+	{
+		Tuplesortstate *state = (Tuplesortstate *) incrsortstate->tuplesortstate;
+		TuplesortInstrumentation stats;
+		const char *sortMethod;
+		const char *spaceType;
+		long		spaceUsed;
+
+		tuplesort_get_stats(state, &stats);
+		sortMethod = tuplesort_method_name(stats.sortMethod);
+		spaceType = tuplesort_space_type_name(stats.spaceType);
+		spaceUsed = stats.spaceUsed;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
+							 sortMethod, spaceType, spaceUsed);
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: %ld\n",
+							 incrsortstate->group_count);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Sort Groups:", NULL,
+								   incrsortstate->group_count, es);
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			TuplesortInstrumentation *sinstrument;
+			const char *sortMethod;
+			const char *spaceType;
+			long		spaceUsed;
+			int64		group_count;
+
+			sinstrument = &incrsortstate->shared_info->sinfo[n].sinstrument;
+			group_count = incrsortstate->shared_info->sinfo[n].group_count;
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
+			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
+			spaceUsed = sinstrument->spaceUsed;
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d:  Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+								 n, sortMethod, spaceType, spaceUsed, group_count);
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Sort Method", sortMethod, es);
+				ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
+				ExplainPropertyText("Sort Space Type", spaceType, es);
+				ExplainPropertyInteger("Sort Groups", NULL, group_count, es);
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1f18e5d3a2..8680e7d911 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -254,6 +255,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -559,8 +564,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 0ab9a9939c..5810ccd329 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -955,6 +964,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1015,6 +1025,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1301,6 +1314,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c227282975..a9dd08fa6f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -694,6 +700,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -840,6 +850,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState  *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..47fc2dda7f
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,682 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *tupleSlot)
+{
+	int presortedCols, i;
+	TupleTableSlot *group_pivot = node->group_pivot;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * We do assume the input is sorted by keys (0, ... n), which means
+	 * the tail keys are more likely to change. So we do the comparison
+	 * from the end, to minimize the number of function calls.
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(group_pivot, attno, &isnullA);
+		datumB = slot_getattr(tupleSlot, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples.  However, in the case
+ * of bounded sort where bound is less than DEFAULT_MIN_GROUP_SIZE we start
+ * looking for the new group when bound is done.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.  It fetches
+ *		groups of tuples where prefix sort columns are equal and sorts them
+ *		using tuplesort.  This approach allows to evade sorting of whole
+ *		dataset.  Besides taking less memory and being faster, it allows to
+ *		start returning tuples before fetching full dataset from outer
+ *		subtree.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *tuplesortstate;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * get state info from node
+	 */
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "entering routine");
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+	/*
+	 * Return next tuple from the current sorted group set if available.
+	 * If there are no more tuples in the current group, we need to try
+	 * to fetch more tuples from the input and build another group.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  false, slot, NULL) || node->finished)
+			return slot;
+	}
+
+	/*
+	 * First time through or no tuples in the current group. Read next
+	 * batch of tuples from the outer plan and pass them to tuplesort.c.
+	 * Subsequent calls just fetch tuples from tuplesort, until the group
+	 * is exhausted, at which point we build the next group.
+	 */
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "sorting subplan");
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/*
+	 * Initialize tuplesort module (needed only before the first group).
+	 */
+	if (node->tuplesortstate == NULL)
+	{
+		/*
+		 * We are going to process the first group of presorted data.
+		 * Initialize support structures for cmpSortPresortedCols - already
+		 * sorted columns.
+		 */
+		preparePresortedCols(node);
+
+		SO1_printf("ExecIncrementalSort: %s\n",
+				   "calling tuplesort_begin_heap");
+
+		/*
+		 * Pass all the columns to tuplesort.  We pass to tuple sort groups
+		 * of at least minGroupSize size.  Thus, these groups doesn't
+		 * necessary have equal value of the first column.
+		 */
+		tuplesortstate = tuplesort_begin_heap(
+									tupDesc,
+									plannode->sort.numCols,
+									plannode->sort.sortColIdx,
+									plannode->sort.sortOperators,
+									plannode->sort.collations,
+									plannode->sort.nullsFirst,
+									work_mem,
+									NULL,
+									false);
+		node->tuplesortstate = (void *) tuplesortstate;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	}
+	node->group_count++;
+
+	/*
+	 * Calculate remaining bound for bounded sort and minimal group size
+	 * accordingly.
+	 */
+	if (node->bounded)
+	{
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
+		minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, node->bound - node->bound_Done);
+	}
+	else
+	{
+		minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+	}
+
+	/* If we got a leftover tuple from the last group, pass it to tuplesort. */
+	if (!TupIsNull(node->group_pivot))
+	{
+		tuplesort_puttupleslot(tuplesortstate, node->group_pivot);
+		ExecClearTuple(node->group_pivot);
+		nTuples++;
+	}
+
+	/*
+	 * Put next group of tuples where presortedCols sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
+
+		if (TupIsNull(slot))
+		{
+			node->finished = true;
+			break;
+		}
+
+		/*
+		 * Accumulate the next group of presorted tuples for tuplesort.
+		 * We always accumulate at least minGroupSize tuples, and only
+		 * then we start to compare the prefix keys.
+		 *
+		 * The last tuple is kept as a pivot, so that we can determine if
+		 * the subsequent tuples have the same prefix key (same group).
+		 */
+		if (nTuples < minGroupSize)
+		{
+			tuplesort_puttupleslot(tuplesortstate, slot);
+
+			/* Keep the last tuple in minimal group as a pivot. */
+			if (nTuples == minGroupSize - 1)
+				ExecCopySlot(node->group_pivot, slot);
+			nTuples++;
+		}
+		else
+		{
+			/*
+			 * Iterate while presorted cols are the same as in the pivot
+			 * tuple.
+			 *
+			 * After accumulating at least minGroupSize tuples (we don't
+			 * know how many groups are there in that set), we need to keep
+			 * accumulating until we reach the end of the group. Only then
+			 * we can do the sort and output all the tuples.
+			 *
+			 * We compare the prefix keys to the pivot - if the prefix keys
+			 * are the same the tuple belongs to the same group, so we pass
+			 * it to the tuplesort.
+			 *
+			 * If the prefix differs, we've reached the end of the group. We
+			 * need to keep the last tuple, so we copy it into the pivot slot
+			 * (it does not serve as pivot, though).
+			 */
+			if (isCurrentGroup(node, slot))
+			{
+				tuplesort_puttupleslot(tuplesortstate, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+	}
+
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
+
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+	if (node->shared_info && node->am_worker)
+	{
+		TuplesortInstrumentation *si;
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		si = &node->shared_info->sinfo[ParallelWorkerNumber].sinstrument;
+		tuplesort_get_stats(tuplesortstate, si);
+		node->shared_info->sinfo[ParallelWorkerNumber].group_count =
+															node->group_count;
+	}
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+	}
+
+	SO1_printf("ExecIncrementalSort: %s\n", "sorting done");
+
+	SO1_printf("ExecIncrementalSort: %s\n",
+			   "retrieving tuple from tuplesort");
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(tuplesortstate,
+								  ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "initializing sort node");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->tuplesortstate = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+
+	SO1_printf("ExecInitIncrementalSort: %s\n",
+			   "sort node initialized");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "shutting down sort node");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+
+	/*
+	 * Release tuplesort resources
+	 */
+	if (node->tuplesortstate != NULL)
+		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+
+	/*
+	 * shut down the subplan
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO1_printf("ExecEndIncrementalSort: %s\n",
+			   "sort node shutdown");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end((Tuplesortstate *) node->tuplesortstate);
+	node->tuplesortstate = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 92855278ad..3ea1b1bca1 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 78deade89b..de27b06e15 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -921,6 +921,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -932,13 +950,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4900,6 +4934,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 237598e110..83a063960f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -830,10 +830,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -843,6 +841,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3755,6 +3771,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..9e0d42322c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2114,12 +2114,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2128,6 +2129,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2761,6 +2788,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b7723481b0..3efc807164 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3884,6 +3884,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..7f820e7351 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,183 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
 
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		output_tuples,
+				output_groups,
+				group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	if (!enable_sort)
+		startup_cost += disable_cost;
+
+	if (!enable_incrementalsort)
+		startup_cost += disable_cost;
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+	{
+		output_tuples = limit_tuples;
+		output_groups = floor(output_tuples / group_tuples) + 1;
+	}
+	else
+	{
+		output_tuples = input_tuples;
+		output_groups = input_groups;
+	}
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (output_groups - 1)
+		+ group_input_run_cost * (output_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * output_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * output_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 08b5061612..454c61e1d8 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -332,6 +332,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1791,19 +1836,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int	n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none.
+	 * Any leading common pathkeys could be useful for ordering because
+	 * we can use the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5adfed..a02b6ee3dd 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -241,6 +243,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -255,6 +261,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -457,6 +465,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1988,6 +2001,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5050,17 +5089,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
 
-	cost_sort(&sort_path, root, NIL,
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5633,9 +5679,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	node = makeNode(Sort);
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5649,6 +5698,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -5995,6 +6075,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6729,6 +6845,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc7f4..38905501d9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4917,8 +4917,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4957,29 +4957,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can take
+				 * advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index dc11f098e0..878cb6b934 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -648,6 +648,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index efd0fbc21c..41a5e18195 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2686,6 +2686,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..91066b238c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2777,6 +2777,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1208eb9a68..63ce07a00c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -942,6 +942,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b8e67899e..5cd35a0221 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +252,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,16 +1250,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1293,7 +1316,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2590,8 +2717,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2641,7 +2767,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3265,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 64122bc1e3..eccbb020b8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1922,6 +1922,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfo	fcinfo; /* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1950,6 +1964,46 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	sinstrument;
+	int64						group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	bool		bounded_Done;	/* value of bounded we did the sort with */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	void	   *tuplesortstate; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 4e2fb39105..0500a3199f 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4b7703d478..ebfee257f3 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1615,6 +1615,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 70f8b8e22b..f9baee6495 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -762,6 +762,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9b6bdbc518..7c05d6cf71 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e70d6a3f18..fe5339ea2e 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -186,6 +186,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..e7a40cec3f 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -183,6 +183,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 4521de18e1..ac3377dccc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -240,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..e11fb617b5 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -19,9 +19,10 @@ Sort
 step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
 QUERY PLAN     
 
-Sort           
+Incremental Sort
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  Presorted Key: id
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..fa7fb23319
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,45 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 10349ec29c..5f17afe0eb 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index f23fe8d870..1226a063ef 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index ca200eb599..e7e80a105c 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -89,6 +89,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b64bed7e60
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,17 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.

#113

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: James Coleman (#112)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:

I think the first thing to do is get some concrete numbers on performance if we:

1. Only sort one group at a time.
2. Update the costing to prefer traditional sort unless we have very
high confidence we'll win with incremental sort.

It'd be nice not to have to add additional complexity if at all possible.

I've been focusing my efforts so far on seeing how much we can
eliminate performance penalties (relative to traditional sort). It
seems that if we can improve things enough there that we'd limit the
amount of adjustment needed to costing -- we'd still need to consider
cases where the lower startup cost results in picking significantly
different plans in a broad sense (presumably due to lower startup cost
and the ability to short circuit on a limit). But I'm hopeful then we
might be able to avoid having to consult MCV lists (and we wouldn't
have that available in all cases anyway)

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

(From reading the whole thread through again, there's a third case:
skewed group sizes, but I'm ignoring that for now because my intuition
is that if the sort can reasonably handle the above two cases *in
execution not just planning* the skew problem will become a
non-issue-)

For (1), I'd really like to be able to find a way to still benefit
from knowing we have prefix columns already sorted. Currently the
patch doesn't do that because of the optimization for small batches
precludes knowing that all prefixes in a batch are indeed equal. As
such I've temporarily removed that optimization in my testing to see
how much of a gain we get from not needing to sort prefix columns.
Intuitively it seems plausible that this would always beat a standard
sort, since a.) equality is sometimes much cheaper than ordering, and
b.) it reduces the likelihood of spilling to disk. In addition my fix
to make top-n heapsort improves things. The one thing I'm not sure
about here is the fact that I haven't seen parallel incremental sort
happen at all, and a parallel regular sort can sometimes beat single
process incremental sort (though at the cost of higher system
resources of course). As noted upthread though I haven't really
investigated the parallel component yet at all.

I did confirm a measurable speedup with suffix-only sorting (roughly
5%, relative to the current patch).

For (2), I'd love to have an optimization for the "batches are a
single tuple" case (indeed in my most common real-world use case,
that's virtually guaranteed). But that optimization is also
incompatible with the current minimum batch size optimization. There
was previously reference to the cost of copying the tuple slot being
the problem factor with small batches, and I've confirmed there is a
performance penalty for small batches if you handle every prefix as a
separate batch (relative to the current patch), but what I plan to
investigate next is whether or not that penalty is primarily due to
the bookkeeping of frequent copying of a pivot tuple or whether it's
due to running the sort algorithm itself so frequently. If it's just
the overhead of copying the pivot tuple, then I'm wondering if adding
functionality to tuplesort to allow peeking at the first inserted
tuple might be worth it.

Alternatively I've been thinking about ways to achieve a hybrid
approach: using single-batch suffix-only sort for large batches and
multi-batch full sort for very small batches. For example, I could
imagine maintaining two different tuplesorts (one for full sort and
one for suffix-only sort) and inserting tuples into the full sorter
until the minimum batch size, and then (like now) checking every tuple
to see when we've finished the batch. If we finish the batch by, say,
2 * minimum batch size, then we perform the full sort. If we don't
find a new batch by that point, we'd go back and check to see if the
tuples we've accumulated so far are actually all the same batch. If
so, we'd move all of them to the suffix-only sorter and continue
adding rows until we hit a new batch. If not, we'd perform the full
sort, but only return tuples out of the full sorter until we encounter
the current batch, and then we'd move the remainder into the
suffix-only sorter.

Using this kind of approach you can also imagine further optimizations
like checking the current tuple against the pivot tuple only every 10
processed tuples or so. That adds another level of complication,
obviously, but could be interesting.

Unrelated: There was some discussion upthread about whether or not
abbreviated column support could/should be supported. The comments
around tuplesort_set_bound in tuplesort.c claim that for bounded sort
abbreviated keys are not useful, and, in fact, the code there disables
abbreviated sort. Given that the single largest use case for this
patch is when we have a LIMIT, I think we can conclude not supporting
abbreviated keys is reasonable.

One other (meta) question: Do we care mostly about "real" queries on
synthetic data (and comparing them to master)? Or do we care more
about comparing solely the difference in sorting cost between the two
approaches? I think some of both is necessary, but I'm curious what
you all think.

James Coleman

#114

Simon Riggs

simon@2ndquadrant.com

over 6 years ago

In reply to: Rafia Sabih (#108)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, 5 Jun 2019 at 17:14, Rafia Sabih <rafia.pghackers@gmail.com> wrote:

Regarding this, I came across this,
/*
* Incremental sort can't be used with either EXEC_FLAG_REWIND,
* EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
* bucket in tuplesortstate.
*/
I think that is quite understandable. How are you planning to support
backwards scan for this? In other words, when will incremental sort be
useful for backward scan.

We stopped materializing the sort by default about 15 years ago because it
wasn't a common use case and it was very expensive for large sorts.

It's no real problem if incremental sorts don't support backwards scans -
we just won't use incremental in that case.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

#115

Simon Riggs

simon@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#113)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:

On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:

I think the first thing to do is get some concrete numbers on performance

if we:

1. Only sort one group at a time.
2. Update the costing to prefer traditional sort unless we have very
high confidence we'll win with incremental sort.

It'd be nice not to have to add additional complexity if at all possible.

I've been focusing my efforts so far on seeing how much we can
eliminate performance penalties (relative to traditional sort). It
seems that if we can improve things enough there that we'd limit the
amount of adjustment needed to costing -- we'd still need to consider
cases where the lower startup cost results in picking significantly
different plans in a broad sense (presumably due to lower startup cost
and the ability to short circuit on a limit). But I'm hopeful then we
might be able to avoid having to consult MCV lists (and we wouldn't
have that available in all cases anyway)

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

What is the specific use case for this? This sounds quite general case.

Do we know something about the nearly-sorted rows that could help us? Or
could we introduce some information elsewhere that would help with the sort?

Could we for-example, pre-sort the rows block by block, or filter out the
rows that are clearly out of order, so we can re-merge them later?

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

#116

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Simon Riggs (#115)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:

On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:

I think the first thing to do is get some concrete numbers on performance if we:

1. Only sort one group at a time.
2. Update the costing to prefer traditional sort unless we have very
high confidence we'll win with incremental sort.

It'd be nice not to have to add additional complexity if at all possible.

I've been focusing my efforts so far on seeing how much we can
eliminate performance penalties (relative to traditional sort). It
seems that if we can improve things enough there that we'd limit the
amount of adjustment needed to costing -- we'd still need to consider
cases where the lower startup cost results in picking significantly
different plans in a broad sense (presumably due to lower startup cost
and the ability to short circuit on a limit). But I'm hopeful then we
might be able to avoid having to consult MCV lists (and we wouldn't
have that available in all cases anyway)

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.

Do we know something about the nearly-sorted rows that could help us? Or could we introduce some information elsewhere that would help with the sort?

Could we for-example, pre-sort the rows block by block, or filter out the rows that are clearly out of order, so we can re-merge them later?

I'm not sure what you mean by "block by block"?

James Coleman

#117

Simon Riggs

simon@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#116)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, 24 Jun 2019 at 18:01, James Coleman <jtc331@gmail.com> wrote:

On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com>
wrote:

What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.

I was trying to think of ways of using external information/circumstance to
knowingly avoid negative use cases. i.e. don't treat sort as a black box,
use its context.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

#118

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#116)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:

On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:

On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:

I think the first thing to do is get some concrete numbers on performance if we:

1. Only sort one group at a time.
2. Update the costing to prefer traditional sort unless we have very
high confidence we'll win with incremental sort.

It'd be nice not to have to add additional complexity if at all possible.

I've been focusing my efforts so far on seeing how much we can
eliminate performance penalties (relative to traditional sort). It
seems that if we can improve things enough there that we'd limit the
amount of adjustment needed to costing -- we'd still need to consider
cases where the lower startup cost results in picking significantly
different plans in a broad sense (presumably due to lower startup cost
and the ability to short circuit on a limit). But I'm hopeful then we
might be able to avoid having to consult MCV lists (and we wouldn't
have that available in all cases anyway)

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.

Yes, that seems like a reasonable approach. Essentially, we're trying to
construct plausible worst case examples, and then minimize the overhead
compared to regular sort. If we get sufficiently close, then it's fine
to rely on somewhat shaky stats - like group size estimates.

Do we know something about the nearly-sorted rows that could help us?
Or could we introduce some information elsewhere that would help with
the sort?

Could we for-example, pre-sort the rows block by block, or filter out
the rows that are clearly out of order, so we can re-merge them
later?

I'm not sure what you mean by "block by block"?

I'm not sure what "block by block" means either.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#119

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Simon Riggs (#117)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jun 24, 2019 at 07:05:24PM +0100, Simon Riggs wrote:

On Mon, 24 Jun 2019 at 18:01, James Coleman <jtc331@gmail.com> wrote:

On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com>
wrote:

What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.

I was trying to think of ways of using external information/circumstance to
knowingly avoid negative use cases. i.e. don't treat sort as a black box,
use its context.

Like what, for example? I'm not saying there's no such additional
information, but I can't think of anything at the moment.

One of the issues is that while the decision whether to use incremental
sort is done during planning, most of the additional (and reliable)
infomation about the data is only available at execution time. And it's
likely not available explicitly - we need to deduce it. Which I think is
exactly what the min group size heuristics is about, no?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#120

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#118)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jun 24, 2019 at 4:16 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:

On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:

On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:

...

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.

Yes, that seems like a reasonable approach. Essentially, we're trying to
construct plausible worst case examples, and then minimize the overhead
compared to regular sort. If we get sufficiently close, then it's fine
to rely on somewhat shaky stats - like group size estimates.

I have a bit of a mystery in my performance testing. I've been setting
up a table like so:

create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
insert into foo(owner_fk, created_at)
select fk_t.i, now() - (time_t.i::text || ' minutes')::interval
from generate_series(1, 10000) time_t(i)
cross join generate_series(1, 1000) fk_t(i);
-- double up on one set to guarantee matching prefixes
insert into foo (owner_fk, created_at) select owner_fk, created_at
from foo where owner_fk = 23;
create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);
analyze foo;

and then I have the following query:

select *
from foo
where owner_fk = 23
order by created_at desc, pk desc
limit 20000;

The idea here is to force a bit of a worst case for small groups: we
have 10,000 batches (i.e., equal prefix groups) of 2 tuples each and
then query with a limit matching the actual number of rows we know
will match the query -- so even though there's a limit we're forcing a
total sort (and also guaranteeing both plans have to touch the same
number of rows). Note: I know that batches of size is actually the
worst case, but I chose batches of two because I've also been testing
a change that would skip the sort entirely for single tuple batches.

On master (really the commit right before the current revision of the
patch), I get:
latency average = 14.271 ms
tps = 70.075243 (excluding connections establishing)

With the patch (and incremental sort enabled):
latency average = 11.975 ms
tps = 83.512090 (excluding connections establishing)

With the patch (but incremental sort disabled):
latency average = 11.884 ms
tps = 84.149834 (excluding connections establishing)

All of those are 60 seconds runs on pgbench with a single thread.

So we have a very substantial speedup with patch *even if the new
feature isn't enabled*. I've confirmed the plan looks the same on
patched with incremental sort disabled and master. The only changes
that would seem to really effect execution time would be the changes
to tuplesort.c, but looking through them I don't see anything I'd
expect to change things so dramatically.

Any thoughts on this?

James Coleman

#121

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#120)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jun 24, 2019 at 07:34:19PM -0400, James Coleman wrote:

On Mon, Jun 24, 2019 at 4:16 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:

On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:

On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:

...

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.

Yes, that seems like a reasonable approach. Essentially, we're trying to
construct plausible worst case examples, and then minimize the overhead
compared to regular sort. If we get sufficiently close, then it's fine
to rely on somewhat shaky stats - like group size estimates.

I have a bit of a mystery in my performance testing. I've been setting
up a table like so:

create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
insert into foo(owner_fk, created_at)
select fk_t.i, now() - (time_t.i::text || ' minutes')::interval
from generate_series(1, 10000) time_t(i)
cross join generate_series(1, 1000) fk_t(i);
-- double up on one set to guarantee matching prefixes
insert into foo (owner_fk, created_at) select owner_fk, created_at
from foo where owner_fk = 23;
create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);
analyze foo;

and then I have the following query:

select *
from foo
where owner_fk = 23
order by created_at desc, pk desc
limit 20000;

The idea here is to force a bit of a worst case for small groups: we
have 10,000 batches (i.e., equal prefix groups) of 2 tuples each and
then query with a limit matching the actual number of rows we know
will match the query -- so even though there's a limit we're forcing a
total sort (and also guaranteeing both plans have to touch the same
number of rows). Note: I know that batches of size is actually the
worst case, but I chose batches of two because I've also been testing
a change that would skip the sort entirely for single tuple batches.

On master (really the commit right before the current revision of the
patch), I get:
latency average = 14.271 ms
tps = 70.075243 (excluding connections establishing)

With the patch (and incremental sort enabled):
latency average = 11.975 ms
tps = 83.512090 (excluding connections establishing)

With the patch (but incremental sort disabled):
latency average = 11.884 ms
tps = 84.149834 (excluding connections establishing)

All of those are 60 seconds runs on pgbench with a single thread.

So we have a very substantial speedup with patch *even if the new
feature isn't enabled*. I've confirmed the plan looks the same on
patched with incremental sort disabled and master. The only changes
that would seem to really effect execution time would be the changes
to tuplesort.c, but looking through them I don't see anything I'd
expect to change things so dramatically.

Any thoughts on this?

I can reproduce the same thing, so it's not just you. On my machine, I see
these tps numbers (average of 10 runs, 60 seconds each):

master: 65.177
patched (on): 80.368
patched (off): 80.750

The numbers are very consistent (within 1 tps).

I've done a bit of CPU profiling, and on master I see this:

13.84% postgres postgres [.] comparetup_heap
4.83% postgres postgres [.] qsort_tuple
3.87% postgres postgres [.] pg_ltostr_zeropad
3.55% postgres postgres [.] pg_ltoa
3.19% postgres postgres [.] AllocSetAlloc
2.68% postgres libc-2.28.so [.] __GI___strlen_sse2
2.38% postgres postgres [.] LWLockRelease
2.38% postgres postgres [.] AppendSeconds.constprop.9
2.22% postgres libc-2.28.so [.] __memmove_sse2_unaligned_erms
2.17% postgres postgres [.] GetPrivateRefCountEntry
2.03% postgres postgres [.] j2date
...

while on patched versions I see this:

4.60% postgres postgres [.] pg_ltostr_zeropad
4.51% postgres postgres [.] pg_ltoa
3.50% postgres postgres [.] AllocSetAlloc
3.34% postgres libc-2.28.so [.] __GI___strlen_sse2
2.99% postgres postgres [.] LWLockRelease
2.84% postgres postgres [.] AppendSeconds.constprop.9
2.65% postgres postgres [.] GetPrivateRefCountEntry
2.64% postgres postgres [.] j2date
2.60% postgres postgres [.] printtup
2.56% postgres postgres [.] heap_hot_search_buffer
...
1.35% postgres postgres [.] comparetup_heap
...

So either we're calling comparetup_heap less often, or it's cheaper.

But it seems to be very dependent on the data set. If you do this:

create table foo_2 as select * from foo order by random();
alter table foo_2 add primary key (pk);
create index idx_foo_2_on_owner_and_created_at on foo_2 (owner_fk, created_at);

and then run the test against this table, there's no difference.

So my guess is this particular data set triggers slightly different
behavior in tuplesort, reducing the cost of comparetup_heap. The speedup
is quite significant (~20% on my system), the question is how widely
applicable can it be.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#122

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#121)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jun 25, 2019 at 12:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jun 24, 2019 at 07:34:19PM -0400, James Coleman wrote:

On Mon, Jun 24, 2019 at 4:16 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:

On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:

On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:

...

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.

Yes, that seems like a reasonable approach. Essentially, we're trying to
construct plausible worst case examples, and then minimize the overhead
compared to regular sort. If we get sufficiently close, then it's fine
to rely on somewhat shaky stats - like group size estimates.

I have a bit of a mystery in my performance testing. I've been setting
up a table like so:

create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
insert into foo(owner_fk, created_at)
select fk_t.i, now() - (time_t.i::text || ' minutes')::interval
from generate_series(1, 10000) time_t(i)
cross join generate_series(1, 1000) fk_t(i);
-- double up on one set to guarantee matching prefixes
insert into foo (owner_fk, created_at) select owner_fk, created_at
from foo where owner_fk = 23;
create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);
analyze foo;

and then I have the following query:

select *
from foo
where owner_fk = 23
order by created_at desc, pk desc
limit 20000;

The idea here is to force a bit of a worst case for small groups: we
have 10,000 batches (i.e., equal prefix groups) of 2 tuples each and
then query with a limit matching the actual number of rows we know
will match the query -- so even though there's a limit we're forcing a
total sort (and also guaranteeing both plans have to touch the same
number of rows). Note: I know that batches of size is actually the
worst case, but I chose batches of two because I've also been testing
a change that would skip the sort entirely for single tuple batches.

On master (really the commit right before the current revision of the
patch), I get:
latency average = 14.271 ms
tps = 70.075243 (excluding connections establishing)

With the patch (and incremental sort enabled):
latency average = 11.975 ms
tps = 83.512090 (excluding connections establishing)

With the patch (but incremental sort disabled):
latency average = 11.884 ms
tps = 84.149834 (excluding connections establishing)

All of those are 60 seconds runs on pgbench with a single thread.

So we have a very substantial speedup with patch *even if the new
feature isn't enabled*. I've confirmed the plan looks the same on
patched with incremental sort disabled and master. The only changes
that would seem to really effect execution time would be the changes
to tuplesort.c, but looking through them I don't see anything I'd
expect to change things so dramatically.

Any thoughts on this?

I can reproduce the same thing, so it's not just you. On my machine, I see
these tps numbers (average of 10 runs, 60 seconds each):

master: 65.177
patched (on): 80.368
patched (off): 80.750

The numbers are very consistent (within 1 tps).

I've done a bit of CPU profiling, and on master I see this:

13.84% postgres postgres [.] comparetup_heap
4.83% postgres postgres [.] qsort_tuple
3.87% postgres postgres [.] pg_ltostr_zeropad
3.55% postgres postgres [.] pg_ltoa
3.19% postgres postgres [.] AllocSetAlloc
2.68% postgres libc-2.28.so [.] __GI___strlen_sse2
2.38% postgres postgres [.] LWLockRelease
2.38% postgres postgres [.] AppendSeconds.constprop.9
2.22% postgres libc-2.28.so [.] __memmove_sse2_unaligned_erms
2.17% postgres postgres [.] GetPrivateRefCountEntry
2.03% postgres postgres [.] j2date
...

while on patched versions I see this:

4.60% postgres postgres [.] pg_ltostr_zeropad
4.51% postgres postgres [.] pg_ltoa
3.50% postgres postgres [.] AllocSetAlloc
3.34% postgres libc-2.28.so [.] __GI___strlen_sse2
2.99% postgres postgres [.] LWLockRelease
2.84% postgres postgres [.] AppendSeconds.constprop.9
2.65% postgres postgres [.] GetPrivateRefCountEntry
2.64% postgres postgres [.] j2date
2.60% postgres postgres [.] printtup
2.56% postgres postgres [.] heap_hot_search_buffer
...
1.35% postgres postgres [.] comparetup_heap
...

So either we're calling comparetup_heap less often, or it's cheaper.

But it seems to be very dependent on the data set. If you do this:

create table foo_2 as select * from foo order by random();
alter table foo_2 add primary key (pk);
create index idx_foo_2_on_owner_and_created_at on foo_2 (owner_fk, created_at);

and then run the test against this table, there's no difference.

So my guess is this particular data set triggers slightly different
behavior in tuplesort, reducing the cost of comparetup_heap. The speedup
is quite significant (~20% on my system), the question is how widely
applicable can it be.

Thanks for confirming!

Given the patch contents I don't see any obviously reason why either
of those possibilities (calling comparetup_heap less often, or it's
cheaper) are likely. Is that something we should investigate further?
Or is it just a nice happy accident that we should ignore since it's
dataset specific?

Anyway, when evaluating the patch performance with oddities like this
would you compare performance of incremental sort on to off on the
same revision or to master when determining if it's a regression in
performance? I could make an argument either way I think.

James Coleman

#123

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: James Coleman (#122)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jun 25, 2019 at 9:53 AM James Coleman <jtc331@gmail.com> wrote:

Given the patch contents I don't see any obviously reason why either
of those possibilities (calling comparetup_heap less often, or it's
cheaper) are likely. Is that something we should investigate further?
Or is it just a nice happy accident that we should ignore since it's
dataset specific?

Have you actually confirmed that comparetup_heap() is called less
often, by instrumenting the number of individual calls specifically?
If there has been a reduction in calls to comparetup_heap(), then
that's weird, and needs to be explained. If it turns out that it isn't
actually called less often, then I would suggest that the speedup
might be related to memory fragmentation, which often matters a lot
within tuplesort.c. (This is why external sort merging now uses big
buffers, and double buffering.)

--
Peter Geoghegan

#124

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Peter Geoghegan (#123)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jun 25, 2019 at 1:13 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jun 25, 2019 at 9:53 AM James Coleman <jtc331@gmail.com> wrote:

Given the patch contents I don't see any obviously reason why either
of those possibilities (calling comparetup_heap less often, or it's
cheaper) are likely. Is that something we should investigate further?
Or is it just a nice happy accident that we should ignore since it's
dataset specific?

Have you actually confirmed that comparetup_heap() is called less
often, by instrumenting the number of individual calls specifically?
If there has been a reduction in calls to comparetup_heap(), then
that's weird, and needs to be explained. If it turns out that it isn't
actually called less often, then I would suggest that the speedup
might be related to memory fragmentation, which often matters a lot
within tuplesort.c. (This is why external sort merging now uses big
buffers, and double buffering.)

No, I haven't confirmed that it's called less frequently, and I'd be
extremely surprised if it were given the diff doesn't suggest any
changes to that at all.

If you think it's important enough to do so, I can instrument it to
confirm, but I was mostly wanting to know if there were any other
plausible explanations, and I think you've provided one: there *are*
changes in the patch to memory contexts in tuplesort.c, so if memory
fragmentation is a real concern this patch could definitely notice
changes in that regard.

James Coleman

#125

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: James Coleman (#124)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jun 25, 2019 at 11:03 AM James Coleman <jtc331@gmail.com> wrote:

No, I haven't confirmed that it's called less frequently, and I'd be
extremely surprised if it were given the diff doesn't suggest any
changes to that at all.

I must have misunderstood, then. I thought that you were suggesting
that that might have happened.

If you think it's important enough to do so, I can instrument it to
confirm, but I was mostly wanting to know if there were any other
plausible explanations, and I think you've provided one: there *are*
changes in the patch to memory contexts in tuplesort.c, so if memory
fragmentation is a real concern this patch could definitely notice
changes in that regard.

Sounds like it's probably fragmentation. That's generally hard to measure.

--
Peter Geoghegan

#126

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Peter Geoghegan (#125)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jun 25, 2019 at 12:13:01PM -0700, Peter Geoghegan wrote:

On Tue, Jun 25, 2019 at 11:03 AM James Coleman <jtc331@gmail.com> wrote:

No, I haven't confirmed that it's called less frequently, and I'd be
extremely surprised if it were given the diff doesn't suggest any
changes to that at all.

I must have misunderstood, then. I thought that you were suggesting
that that might have happened.

If you think it's important enough to do so, I can instrument it to
confirm, but I was mostly wanting to know if there were any other
plausible explanations, and I think you've provided one: there *are*
changes in the patch to memory contexts in tuplesort.c, so if memory
fragmentation is a real concern this patch could definitely notice
changes in that regard.

Sounds like it's probably fragmentation. That's generally hard to measure.

I'm not sure I'm really conviced this explains the difference, because
the changes in tuplesort.c are actually fairly small - we do split the
tuplesort context into two, but vast majority of the stuff is allocated
in one of the contexts (essentially just the tuplesort state gets moved
to a new context). I wouldn't expect this to have such strong impact on
locality/fragmentation.

But maybe it does - in that case it seems it might be worthwile to do it
separately, irrespectedly of the incremental sort patch. I wonder if
perf would show that as cache hits/misses, or something?

It shouldn't be that difficult to separate this change into a separate
patch, and benchmark it on it's own, though.

FWIW while looking at the tuplesort.c changes, I've noticed some
inaccurate comments in tuplesort_free. Firstly, the top-level comment
says:

/*
* tuplesort_free
*
* Internal routine for freeing resources of tuplesort.
*/

without mentioning which resources it actually releases, so it kinda
suggests it releases everything. But that's not true - AFAICS it only
releases the per-sort resources. IMO this is a poor function name, and
people will easily keep resources longer than they think - we should
rename it to something like tuplesort_free_batch().

And then at the end tuplesort_free() does this:

/*
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
MemoryContextReset(state->sortcontext);

But that's clearly not true, because the tuplesortstate is allocated in
the maincontext, not sortcontext.

In general, the comments seem to be a bit confused by what 'sort' means.
Sometimes it means the whole sort operation, sometimes it means one of
the batches, etc. And the fact that the per-batch context is called
sortcontext does not really improve the situation.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#127

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#126)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jun 25, 2019 at 4:32 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jun 25, 2019 at 12:13:01PM -0700, Peter Geoghegan wrote:

On Tue, Jun 25, 2019 at 11:03 AM James Coleman <jtc331@gmail.com> wrote:

No, I haven't confirmed that it's called less frequently, and I'd be
extremely surprised if it were given the diff doesn't suggest any
changes to that at all.

I must have misunderstood, then. I thought that you were suggesting
that that might have happened.

If you think it's important enough to do so, I can instrument it to
confirm, but I was mostly wanting to know if there were any other
plausible explanations, and I think you've provided one: there *are*
changes in the patch to memory contexts in tuplesort.c, so if memory
fragmentation is a real concern this patch could definitely notice
changes in that regard.

Sounds like it's probably fragmentation. That's generally hard to measure.

I'm not sure I'm really conviced this explains the difference, because
the changes in tuplesort.c are actually fairly small - we do split the
tuplesort context into two, but vast majority of the stuff is allocated
in one of the contexts (essentially just the tuplesort state gets moved
to a new context). I wouldn't expect this to have such strong impact on
locality/fragmentation.

OTOH it is as you noted heavily dependent on data...so it's hard to
say if it's a real win or not.

But maybe it does - in that case it seems it might be worthwile to do it
separately, irrespectedly of the incremental sort patch. I wonder if
perf would show that as cache hits/misses, or something?

It shouldn't be that difficult to separate this change into a separate
patch, and benchmark it on it's own, though.

I don't know enough about perf to say, but unless this ends up being a
sticking point for the patch I'll probably avoid it for now because
there are too many other things to worry about in the patch.

FWIW while looking at the tuplesort.c changes, I've noticed some
inaccurate comments in tuplesort_free. Firstly, the top-level comment
says:

/*
* tuplesort_free
*
* Internal routine for freeing resources of tuplesort.
*/

without mentioning which resources it actually releases, so it kinda
suggests it releases everything. But that's not true - AFAICS it only
releases the per-sort resources. IMO this is a poor function name, and
people will easily keep resources longer than they think - we should
rename it to something like tuplesort_free_batch().

And then at the end tuplesort_free() does this:

/*
* Free the per-sort memory context, thereby releasing all working memory,
* including the Tuplesortstate struct itself.
*/
MemoryContextReset(state->sortcontext);

But that's clearly not true, because the tuplesortstate is allocated in
the maincontext, not sortcontext.

In general, the comments seem to be a bit confused by what 'sort' means.
Sometimes it means the whole sort operation, sometimes it means one of
the batches, etc. And the fact that the per-batch context is called
sortcontext does not really improve the situation.

There are also quite a few misleading or out of date comments in
nodeIncrementalSort.c as well. I'm currently working on the hybrid
approach I mentioned earlier, but once the patch proper looks like
we're coming close to addressing the performance concerns/costing I'll
look at doing a pass through the comments to clean them up.

Unrelated: if you or someone else you know that's more familiar with
the parallel code, I'd be interested in their looking at the patch at
some point, because I have a suspicion it might not be operating in
parallel ever (either that or I don't know how to trigger it), but I'm
not really familiar with that stuff at all currently. :)

James Coleman

#128

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#127)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:

Unrelated: if you or someone else you know that's more familiar with
the parallel code, I'd be interested in their looking at the patch at
some point, because I have a suspicion it might not be operating in
parallel ever (either that or I don't know how to trigger it), but I'm
not really familiar with that stuff at all currently. :)

That's an interesting question. I don't think plans like this would be
very interesting:

Limit
-> Incremental Sort
-> Gather Merge
-> Index Scan

because most of the extra cost would be paid in the leader anyway. So
I'm not all that surprised those paths are not generated (I might be
wrong and those plans would be interesting, though).

But I think something like this would be quite beneficial:

Limit
-> Gather Merge
-> Incremental Sort
-> Index Scan

So I've looked into that, and the reason seems fairly simple - when
generating the Gather Merge paths, we only look at paths that are in
partial_pathlist. See generate_gather_paths().

And we only have sequential + index paths in partial_pathlist, not
incremental sort paths.

IMHO we can do two things:

1) modify generate_gather_paths to also consider incremental sort for
each sorted path, similarly to what create_ordered_paths does

2) modify build_index_paths to also generate an incremental sort path
for each index path

IMHO (1) is the right choice here, because it automatically does the
trick for all other types of ordered paths, not just index scans. So,
something like the attached patch, which gives me plans like this:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.86..2.85 rows=100000 width=12) (actual time=3.726..233.249 rows=100000 loops=1)
-> Gather Merge (cost=0.86..120.00 rows=5999991 width=12) (actual time=3.724..156.802 rows=100000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=0.84..100.00 rows=2499996 width=12) (actual time=0.563..164.438 rows=33910 loops=3)
Sort Key: a, b
Presorted Key: a
Sort Method: quicksort Memory: 27kB
Sort Groups: 389
Worker 0: Sort Method: quicksort Memory: 27kB Groups: 1295
Worker 1: Sort Method: quicksort Memory: 27kB Groups: 1241
-> Parallel Index Scan using t_a_idx on t (cost=0.43..250612.29 rows=2499996 width=12) (actual time=0.027..128.518 rows=33926 loops=3)
Planning Time: 68559.695 ms
Execution Time: 285.245 ms
(14 rows)

This is not the whole story, though - there seems to be some costing
issue, because even with the parallel costs set to 0, I only get such
plans after I tweak the cost in the patch like this:

subpath->total_cost = 100.0;
path->path.total_cost = 120.0;

When I don't do that, the gather merge gets with total cost 1037485, and
it gets beaten by this plan:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1.05..152.75 rows=100000 width=12) (actual time=0.234..374.492 rows=100000 loops=1)
-> Incremental Sort (cost=1.05..9103.09 rows=5999991 width=12) (actual time=0.232..316.210 rows=100000 loops=1)
Sort Key: a, b
Presorted Key: a
Sort Method: quicksort Memory: 27kB
Sort Groups: 2863
-> Index Scan using t_a_idx on t (cost=0.43..285612.24 rows=5999991 width=12) (actual time=0.063..240.248 rows=100003 loops=1)
Planning Time: 53743.858 ms
Execution Time: 403.379 ms
(9 rows)

I suspect it's related to the fact that for the Gather Merge plan we
don't have the information about the number of rows, while for the
incremental sort we have it.

But clearly 9103.09 is not total cost for all 6M rows the incremental
sort is expected to produce (because that has to be higher than 285612,
which is the cost of the index scan). So it seems like the total cost of
the incremental sort is ~546000, because

(100000 / 6000000) * 546000 = 9133

so close to the total cost of the incremental sort. But then

100000.0 / 6000000 * 9133 = 152

so it seems we actually do the linear approximation twice. That seems
pretty bogus, IMO. And indeed, if I remove this part from
cost_incremental_sort:

if (limit_tuples > 0 && limit_tuples < input_tuples)
{
output_tuples = limit_tuples;
output_groups = floor(output_tuples / group_tuples) + 1;
}

then it behaves kinda reasonable:

explain select * from t order by a, b limit 100000;

QUERY PLAN
-----------------------------------------------------------------------------------------
Limit (cost=1.05..9103.12 rows=100000 width=12)
-> Incremental Sort (cost=1.05..546124.41 rows=5999991 width=12)
Sort Key: a, b
Presorted Key: a
-> Index Scan using t_a_idx on t (cost=0.43..285612.24 rows=5999991 width=12)
(5 rows)

set parallel_tuple_cost = 0;
set parallel_setup_cost = 0;

explain select * from t order by a, b limit 100000;

QUERY PLAN
--------------------------------------------------------------------------------------------------------
Limit (cost=0.86..6775.63 rows=100000 width=12)
-> Gather Merge (cost=0.86..406486.25 rows=5999991 width=12)
Workers Planned: 2
-> Incremental Sort (cost=0.84..343937.44 rows=2499996 width=12)
Sort Key: a, b
Presorted Key: a
-> Parallel Index Scan using t_a_idx on t (cost=0.43..250612.29 rows=2499996 width=12)
(7 rows)

But I'm not going to claim those are total fixes, it's the minimum I
needed to do to make this particular type of plan work.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

parallel-incremental-sort.patchtext/plain; charset=us-asciiDownload

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b7723481b0..15c45202f9 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2719,6 +2719,8 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	{
 		Path	   *subpath = (Path *) lfirst(lc);
 		GatherMergePath *path;
+		bool		is_sorted;
+		int			presorted_keys;
 
 		if (subpath->pathkeys == NIL)
 			continue;
@@ -2727,6 +2729,33 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 		path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
 										subpath->pathkeys, NULL, rowsp);
 		add_path(rel, &path->path);
+
+		/* consider incremental sort */
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 subpath->pathkeys, &presorted_keys);
+
+		if (!is_sorted && (presorted_keys > 0))
+		{
+			elog(WARNING, "adding incremental sort + gather merge path");
+
+			/* Also consider incremental sort. */
+			subpath = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															root->sort_pathkeys,
+															presorted_keys,
+															-1);
+
+			subpath->total_cost = 100.0;
+
+			path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+											subpath->pathkeys, NULL, rowsp);
+
+			path->path.total_cost = 120.0;
+
+			add_path(rel, &path->path);
+		}
+
 	}
 }

#129

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: James Coleman (#113)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jun 24, 2019 at 11:10 AM James Coleman <jtc331@gmail.com> wrote:

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

...

Alternatively I've been thinking about ways to achieve a hybrid
approach: using single-batch suffix-only sort for large batches and
multi-batch full sort for very small batches. For example, I could
imagine maintaining two different tuplesorts (one for full sort and
one for suffix-only sort) and inserting tuples into the full sorter
until the minimum batch size, and then (like now) checking every tuple
to see when we've finished the batch. If we finish the batch by, say,
2 * minimum batch size, then we perform the full sort. If we don't
find a new batch by that point, we'd go back and check to see if the
tuples we've accumulated so far are actually all the same batch. If
so, we'd move all of them to the suffix-only sorter and continue
adding rows until we hit a new batch. If not, we'd perform the full
sort, but only return tuples out of the full sorter until we encounter
the current batch, and then we'd move the remainder into the
suffix-only sorter.

Over the past week or two I've implemented this approach. The attached
patch implements the sort retaining the minimum group size logic
already in the patch, but then after 32 tuples only looks for a
different prefix key group for the next 32 tuples. If it finds a new
prefix key group, then everything happens exactly as before. However
if it doesn't find a new prefix key group in that number of tuples,
then it assumes it's found a large set of tuples in a single prefix
key group. To guarantee this is the case it has to transfer tuples
from the full sorting tuplesort into a presorted prefix tuplesort and
(leaving out a few details here about how it handles the possibility
of multiple prefix key groups already in the sort) continues to fill
up that optimized sort with the large group.

This approach should allow us to, broadly speaking, have our cake and
eat it too with respect to small and large batches of tuples in prefix
key groups. There is of course the cost of switching modes, but this
is much like the cost that bound sort pays to switch into top-n heap
sorting mode; there's an inflection point, and you pay some extra cost
right at it, but it's worth it in the broad case.

Initial testing shows promising performance (akin to the original
patch for small groups and similar to my variant that only ever sorted
by single prefix key groups for large batches)

A couple of thoughts:
- It'd be nice if tuplesort allowed us to pull back out tuples in FIFO
manner without sorting. That'd lower the inflection point cost of
switching modes.
- I haven't adjusted costing yet for this change in approach; I wanted
to take a more holistic look at that after getting this working.
- I don't have strong feelings about the group size inflection points.
More perf testing would be useful, and also it's plausible we should
do more adjusting of those heuristics based on the size of the bound,
if we have one.
- The guc enable_sort current also disabled incremental sort, which
makes sense, but it also means there's not a good way to tweak a plan
using a full sort into a plan that uses incremental sort. That seems
not the greatest, but I'm not sure what the best solution would be.
- I did a lot of comment work in this patch, but comments for changes
to tuplesort.c and elsewhere still need to be cleaned up.

A few style notes:
- I know some of the variable declarations don't line up well; I need
to figure out how to get Vim to do what appears to be the standard
style in PG source.
- All code and comment lines are the right length, I think with the
exception of debug printf statements. I'm not sure if long strings are
an exception.
- Lining up arguments is another thing Vim isn't setup to do, though
maybe someone has some thoughts on a good approach.
I'm planning to look at the formatting utility when I get a chance,
but it'd be nice to have the editor handle the basic setup most of the
time.

Process questions:
- Do I need to explicitly move the patch somehow to the next CF?
- Since I've basically taken over patch ownership, should I move my
name from reviewer to author in the CF app? And can there be two
authors listed there?

James Coleman

Attachments:

incremental-sort-29.patchtext/x-patch; charset=US-ASCII; name=incremental-sort-29.patchDownload

commit 12be7f7f997debe4e05e84b69c03ecf7051b1d79
Author: jcoleman <james.coleman@getbraintree.com>
Date:   Fri May 31 14:40:17 2019 +0000

    Implement incremental sort
    
    Incremental sort is an optimized variant of multikey sort for cases
    when the input is already sorted by a prefix of the sort keys. For
    example when a sort by (key1, key2 ... keyN) is requested, and the
    input is already sorted by (key1, key2 ... keyM), M < N, we can
    divide the input into groups where keys (key1, ... keyM) are equal,
    and only sort on the remaining columns.
    
    The implemented algorithm operates in two different modes:
      - Fetching a minimum number of tuples without checking prefix key
        group membership and sorting on all columns when safe.
      - Fetching all tuples for a single prefix key group and sorting on
        solely the unsorted columns.
    We always begin in the first mode, and employ a heuristic to switch
    into the second mode if we believe it's beneficial.
    
    Sorting incrementally can potentially use less memory (and possibly
    avoid spilling to disk), avoid fetching and sorting all tuples in the
    dataset (particularly useful when a LIMIT clause has been specified),
    and begin returning tuples before the entire result set is available.
    Small datasets which fit entirely in memory and must be fully realized
    and sorted may be slightly slower, which we reflect in the costing
    implementation.
    
    The hybrid mode approach allows us to optimize for both very small
    groups (where the overhead of a new tuplesort is high) and very large
    groups (where we can lower cost by not having to sort on already sorted
    columns), albeit at some extra cost while switching between modes.
    
    Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..9ba845b53a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4368,6 +4368,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..f9489d8704 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1215,6 +1219,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1841,6 +1848,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2175,12 +2188,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2191,7 +2221,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2215,7 +2245,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2284,7 +2314,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2341,7 +2371,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2354,13 +2384,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2400,9 +2431,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2612,6 +2647,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->fullsort_state != NULL)
+	{
+		/* TODO: is it valid to get space used etc. only once given we re-use the sort? */
+		/* TODO: maybe show average, min, max sort group size? */
+
+		Tuplesortstate *fullsort_state = incrsortstate->fullsort_state;
+		TuplesortInstrumentation fullsort_stats;
+		const char *fullsort_sortMethod;
+		const char *fullsort_spaceType;
+		Tuplesortstate *prefixsort_state = incrsortstate->prefixsort_state;
+		TuplesortInstrumentation prefixsort_stats;
+		const char *prefixsort_sortMethod;
+		const char *prefixsort_spaceType;
+
+		tuplesort_get_stats(fullsort_state, &fullsort_stats);
+		fullsort_sortMethod = tuplesort_method_name(fullsort_stats.sortMethod);
+		fullsort_spaceType = tuplesort_space_type_name(fullsort_stats.spaceType);
+		if (prefixsort_state != NULL)
+		{
+			tuplesort_get_stats(prefixsort_state, &prefixsort_stats);
+			prefixsort_sortMethod = tuplesort_method_name(prefixsort_stats.sortMethod);
+			prefixsort_spaceType = tuplesort_space_type_name(prefixsort_stats.spaceType);
+		}
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: Full: %s  %s: %ldkB",
+							 fullsort_sortMethod, fullsort_spaceType,
+							 fullsort_stats.spaceUsed);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %s %s: %ldkB\n",
+								 prefixsort_sortMethod, prefixsort_spaceType,
+								 prefixsort_stats.spaceUsed);
+			else
+				appendStringInfo(es->str, "\n");
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: Full:  %ld",
+							 incrsortstate->fullsort_group_count);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %ld\n",
+							 incrsortstate->prefixsort_group_count);
+			else
+				appendStringInfo(es->str, "\n");
+		}
+		else
+		{
+			/* TODO */
+			ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+			ExplainPropertyInteger("Full Sort Space Used", "kB",
+					fullsort_stats.spaceUsed, es);
+			ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+			ExplainPropertyInteger("Full Sort Groups", NULL,
+								   incrsortstate->fullsort_group_count, es);
+
+			if (prefixsort_state != NULL)
+			{
+				ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+				ExplainPropertyInteger("Prefix Sort Space Used", "kB",
+						prefixsort_stats.spaceUsed, es);
+				ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+				ExplainPropertyInteger("Prefix Sort Groups", NULL,
+									   incrsortstate->prefixsort_group_count, es);
+			}
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+				&incrsortstate->shared_info->sinfo[n];
+			TuplesortInstrumentation *fullsort_instrument;
+			const char *fullsort_sortMethod;
+			const char *fullsort_spaceType;
+			long		fullsort_spaceUsed;
+			int64		fullsort_group_count;
+			TuplesortInstrumentation *prefixsort_instrument;
+			const char *prefixsort_sortMethod;
+			const char *prefixsort_spaceType;
+			long		prefixsort_spaceUsed;
+			int64		prefixsort_group_count;
+
+			fullsort_instrument = &incsort_info->fullsort_instrument;
+			fullsort_group_count = incsort_info->fullsort_group_count;
+
+			prefixsort_instrument = &incsort_info->prefixsort_instrument;
+			prefixsort_group_count = incsort_info->prefixsort_group_count;
+
+			if (fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+
+			fullsort_sortMethod = tuplesort_method_name(
+					fullsort_instrument->sortMethod);
+			fullsort_spaceType = tuplesort_space_type_name(
+					fullsort_instrument->spaceType);
+			fullsort_spaceUsed = fullsort_instrument->spaceUsed;
+
+			if (prefixsort_instrument)
+			{
+				prefixsort_sortMethod = tuplesort_method_name(
+						prefixsort_instrument->sortMethod);
+				prefixsort_spaceType = tuplesort_space_type_name(
+						prefixsort_instrument->spaceType);
+				prefixsort_spaceUsed = prefixsort_instrument->spaceUsed;
+			}
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d: Full Sort Method: %s  %s: %ldkB  Groups: %ld",
+								 n, fullsort_sortMethod, fullsort_spaceType,
+								 fullsort_spaceUsed, fullsort_group_count);
+				if (prefixsort_instrument)
+					appendStringInfo(es->str,
+									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld",
+									 prefixsort_sortMethod, prefixsort_spaceType,
+									 prefixsort_spaceUsed, prefixsort_group_count);
+				else
+					appendStringInfo(es->str, "\n");
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+				ExplainPropertyInteger("Full Sort Space Used", "kB", fullsort_spaceUsed, es);
+				ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+				ExplainPropertyInteger("Full Sort Groups", NULL, fullsort_group_count, es);
+				if (prefixsort_instrument)
+				{
+					ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+					ExplainPropertyInteger("Prefix Sort Space Used", "kB", prefixsort_spaceUsed, es);
+					ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+					ExplainPropertyInteger("Prefix Sort Groups", NULL, prefixsort_group_count, es);
+				}
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1f18e5d3a2..8680e7d911 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -254,6 +255,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -559,8 +564,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 53cd2fc666..bf11a08644 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -955,6 +964,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1015,6 +1025,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1301,6 +1314,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c227282975..a9dd08fa6f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -694,6 +700,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -840,6 +850,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState  *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..c3b903e568
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1107 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int presortedCols, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail keys
+	 * are more likely to change. Therefore we do our comparison starting from
+	 * the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(IncrementalSortState *node)
+{
+	ScanDirection		dir;
+	int64 nTuples = 0;
+	bool lastTuple = false;
+	bool firstTuple = true;
+	TupleDesc		    tupDesc;
+	PlanState		   *outerNode;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal
+		 * and thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(
+				tupDesc,
+				plannode->sort.numCols - presortedCols,
+				&(plannode->sort.sortColIdx[presortedCols]),
+				&(plannode->sort.sortOperators[presortedCols]),
+				&(plannode->sort.collations[presortedCols]),
+				&(plannode->sort.nullsFirst[presortedCols]),
+				work_mem,
+				NULL,
+				false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure
+	 * the tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+				node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+					ScanDirectionIsForward(dir),
+					false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to save the
+			 * first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/* The tuple isn't part of the current batch so we need to carry
+				 * it over into the next set up tuples we transfer out of the full
+				 * sort tuplesort into the presorted prefix tuplesort. We don't
+				 * actually have to do anything special to save the tuple since
+				 * we've already loaded it into the node->transfer_tuple slot, and,
+				 * even though that slot points to memory inside the full sort
+				 * tuplesort, we can't reset that tuplesort anyway until we've
+				 * fully transferred out of its tuples, so this reference is safe.
+				 * We do need to reset the group pivot tuple though since we've
+				 * finished the current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch
+		 * is in the same prefix key group and moved all of those tuples into
+		 * the presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/* Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume
+		 * we have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner node
+		 * read out all of those tuples, and then come back around to find
+		 * another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *read_sortstate;
+	Tuplesortstate	   *fullsort_state;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+			|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->finished)
+			/*
+			 * TODO: there isn't a good test case for the node->finished
+			 * case directly, but lots of other stuff fails if it's not
+			 * there. If the outer node will fail when trying to fetch
+			 * too many tuples, then things break if that test isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the full
+			 * sort tuplesort. The first call to switchToPresortedPrefixMode()
+			 * pulled the one of those groups out, and we've returned those
+			 * tuples to the inner node, but if we tuples remaining in that
+			 * tuplesort (i.e., n_fullsort_remaining > 0) at this point we
+			 * need to do that again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(node);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're not
+			 * in the middle of transitioning into presorted prefix sort mode,
+			 * then it's time to start the process all over again by building
+			 * new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(
+					tupDesc,
+					plannode->sort.numCols,
+					plannode->sort.sortColIdx,
+					plannode->sort.sortOperators,
+					plannode->sort.collations,
+					plannode->sort.nullsFirst,
+					work_mem,
+					NULL,
+					false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64 currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n heap
+			 * sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/* Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't assume
+			 * the group pivot tuple will reamin the same -- unless we're using
+			 * a minimum group size of 1, in which case the pivot is obviously
+			 * still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us anymore tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then don't
+				 * both with checking for inclusion in the current prefix group
+				 * since a large number of very tiny sorts is inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this
+					 * sort group. Instead we need to carry it over to the
+					 * next group. We use the group_pivot slot as a temp
+					 * container for that purpose even though we won't actually
+					 * treat it as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound remaining
+						 * is (original bound - n), so store the current number
+						 * of processed tuples for use in configuring sorting
+						 * bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+					tuplesort_performsort(fullsort_state);
+					node->fullsort_group_count++;
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found
+			 * a large group of tuples having a single prefix key (as long
+			 * as the last tuple didn't shift us into reading from the full
+			 * sort mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+					node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into the
+				 * tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n heapsort mode
+				 * then we will only be able to retrieve currentBound tuples (since the
+				 * tuplesort will have only retained the top-n tuples). This is safe even
+				 * though we haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the retained ones,
+				 * and we're already contractually guaranteed to not need any more than the
+				 * currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64 currentBound = node->bound - node->bound_Done;
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the fullsort
+				 * to presorted prefix sort (we might have multiple prefix key
+				 * groups, so we need a way to see if we've actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(node);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop out
+				 * of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've already
+		 * established a current group pivot tuple (but wasn't carried over;
+		 * it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix tuplesort
+				 * until we find changed prefix keys. Only then can we guarantee
+				 * sort stability of the tuples we've already accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		IncrementalSortInfo *incsort_info =
+			&node->shared_info->sinfo[ParallelWorkerNumber];
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		tuplesort_get_stats(fullsort_state, &incsort_info->fullsort_instrument);
+		incsort_info->fullsort_group_count = node->fullsort_group_count;
+
+		if (node->prefixsort_state)
+		{
+			tuplesort_get_stats(node->prefixsort_state,
+					&incsort_info->prefixsort_instrument);
+			incsort_info->prefixsort_group_count = node->prefixsort_group_count;
+		}
+	}
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_group_count = 0;
+	incrsortstate->prefixsort_group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+		tuplesort_end(node->fullsort_state);
+	node->fullsort_state = NULL;
+	if (node->prefixsort_state != NULL)
+		tuplesort_end(node->prefixsort_state);
+	node->prefixsort_state = NULL;
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 92855278ad..3ea1b1bca1 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 78deade89b..de27b06e15 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -921,6 +921,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -932,13 +950,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4900,6 +4934,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8400dd319e..b8c3826a17 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -830,10 +830,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -843,6 +841,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3755,6 +3771,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..9e0d42322c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2114,12 +2114,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2128,6 +2129,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2761,6 +2788,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b7723481b0..3efc807164 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3884,6 +3884,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..7f820e7351 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,183 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
 
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		output_tuples,
+				output_groups,
+				group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	if (!enable_sort)
+		startup_cost += disable_cost;
+
+	if (!enable_incrementalsort)
+		startup_cost += disable_cost;
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+	{
+		output_tuples = limit_tuples;
+		output_groups = floor(output_tuples / group_tuples) + 1;
+	}
+	else
+	{
+		output_tuples = input_tuples;
+		output_groups = input_groups;
+	}
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (output_groups - 1)
+		+ group_input_run_cost * (output_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * output_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * output_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 08b5061612..454c61e1d8 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -332,6 +332,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1791,19 +1836,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int	n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none.
+	 * Any leading common pathkeys could be useful for ordering because
+	 * we can use the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 12fba56285..bfb52f21ab 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -241,6 +243,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -255,6 +261,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -457,6 +465,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1988,6 +2001,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5050,17 +5089,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
 
-	cost_sort(&sort_path, root, NIL,
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5633,9 +5679,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	node = makeNode(Sort);
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5649,6 +5698,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -5995,6 +6075,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6729,6 +6845,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 401299e542..16996b1bc2 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,8 +4922,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4962,29 +4962,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can take
+				 * advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index dc11f098e0..878cb6b934 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -648,6 +648,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index efd0fbc21c..41a5e18195 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2686,6 +2686,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..91066b238c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2777,6 +2777,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 631f16f5fe..e90692287b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -941,6 +941,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b8e67899e..16098ed8eb 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +252,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1222,17 +1249,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1293,7 +1322,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2590,8 +2723,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2641,7 +2773,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3271,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index c119fdf4fa..3e48593543 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 99b9fa414f..42d5a46974 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1924,6 +1924,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfo	fcinfo; /* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1952,6 +1966,60 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	fullsort_instrument;
+	int64						fullsort_group_count;
+	TuplesortInstrumentation	prefixsort_instrument;
+	int64						prefixsort_group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64			n_fullsort_remaining;
+	Tuplesortstate	   *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate	   *prefixsort_state; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		fullsort_group_count;	/* number of groups with equal presorted keys */
+	int64		prefixsort_group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 4e2fb39105..0500a3199f 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 441e64eca9..9d45feb37b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1614,6 +1614,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 70f8b8e22b..f9baee6495 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -762,6 +762,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..b9d7a77e65 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 182ffeef4b..61c3940921 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..e7a40cec3f 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -183,6 +183,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 4521de18e1..65a73af214 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -216,6 +216,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -240,6 +241,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..3a58efdf91
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1160 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 10349ec29c..5f17afe0eb 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 8fb55f045e..f5f4f7c9f9 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a39ca1012a..7afd0cc373 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b9df37412f
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,78 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.

#130

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#128)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:

Unrelated: if you or someone else you know that's more familiar with
the parallel code, I'd be interested in their looking at the patch at
some point, because I have a suspicion it might not be operating in

...

So I've looked into that, and the reason seems fairly simple - when
generating the Gather Merge paths, we only look at paths that are in
partial_pathlist. See generate_gather_paths().

And we only have sequential + index paths in partial_pathlist, not
incremental sort paths.

IMHO we can do two things:

1) modify generate_gather_paths to also consider incremental sort for
each sorted path, similarly to what create_ordered_paths does

2) modify build_index_paths to also generate an incremental sort path
for each index path

IMHO (1) is the right choice here, because it automatically does the
trick for all other types of ordered paths, not just index scans. So,
something like the attached patch, which gives me plans like this:

...

But I'm not going to claim those are total fixes, it's the minimum I
needed to do to make this particular type of plan work.

Thanks for looking into this!

I intended to apply this to my most recent version of the patch (just
sent a few minutes ago), but when I apply it I noticed that the
partition_aggregate regression tests have several of these failures:

ERROR: could not find pathkey item to sort

I haven't had time to look into the cause yet, so I decided to wait
until the next patch revision.

James Coleman

#131

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#130)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Jul 04, 2019 at 09:29:49AM -0400, James Coleman wrote:

On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:

Unrelated: if you or someone else you know that's more familiar with
the parallel code, I'd be interested in their looking at the patch at
some point, because I have a suspicion it might not be operating in

...

So I've looked into that, and the reason seems fairly simple - when
generating the Gather Merge paths, we only look at paths that are in
partial_pathlist. See generate_gather_paths().

And we only have sequential + index paths in partial_pathlist, not
incremental sort paths.

IMHO we can do two things:

1) modify generate_gather_paths to also consider incremental sort for
each sorted path, similarly to what create_ordered_paths does

2) modify build_index_paths to also generate an incremental sort path
for each index path

IMHO (1) is the right choice here, because it automatically does the
trick for all other types of ordered paths, not just index scans. So,
something like the attached patch, which gives me plans like this:

...

But I'm not going to claim those are total fixes, it's the minimum I
needed to do to make this particular type of plan work.

Thanks for looking into this!

I intended to apply this to my most recent version of the patch (just
sent a few minutes ago), but when I apply it I noticed that the
partition_aggregate regression tests have several of these failures:

ERROR: could not find pathkey item to sort

I haven't had time to look into the cause yet, so I decided to wait
until the next patch revision.

FWIW I don't claim the patch I shared is complete and/or 100% correct.
It was more an illustration of the issue and the smallest patch to make
a particular query work. The test failures are a consequence of that.

I'll try looking into the failures over the next couple of days, but I
can't promise anything.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#132

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#131)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Jul 4, 2019 at 10:46 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Jul 04, 2019 at 09:29:49AM -0400, James Coleman wrote:

On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:

Unrelated: if you or someone else you know that's more familiar with
the parallel code, I'd be interested in their looking at the patch at
some point, because I have a suspicion it might not be operating in

...

So I've looked into that, and the reason seems fairly simple - when
generating the Gather Merge paths, we only look at paths that are in
partial_pathlist. See generate_gather_paths().

And we only have sequential + index paths in partial_pathlist, not
incremental sort paths.

IMHO we can do two things:

1) modify generate_gather_paths to also consider incremental sort for
each sorted path, similarly to what create_ordered_paths does

2) modify build_index_paths to also generate an incremental sort path
for each index path

IMHO (1) is the right choice here, because it automatically does the
trick for all other types of ordered paths, not just index scans. So,
something like the attached patch, which gives me plans like this:

...

But I'm not going to claim those are total fixes, it's the minimum I
needed to do to make this particular type of plan work.

Thanks for looking into this!

I intended to apply this to my most recent version of the patch (just
sent a few minutes ago), but when I apply it I noticed that the
partition_aggregate regression tests have several of these failures:

ERROR: could not find pathkey item to sort

I haven't had time to look into the cause yet, so I decided to wait
until the next patch revision.

FWIW I don't claim the patch I shared is complete and/or 100% correct.
It was more an illustration of the issue and the smallest patch to make
a particular query work. The test failures are a consequence of that.

I'll try looking into the failures over the next couple of days, but I
can't promise anything.

Yep, I understand, I just wanted to note that it was still an
outstanding item and give a quick update on why so.

Anything you can look at is much appreciated.

James Coleman

#133

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#130)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Jul 04, 2019 at 09:29:49AM -0400, James Coleman wrote:

On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:

Unrelated: if you or someone else you know that's more familiar with
the parallel code, I'd be interested in their looking at the patch at
some point, because I have a suspicion it might not be operating in

...

So I've looked into that, and the reason seems fairly simple - when
generating the Gather Merge paths, we only look at paths that are in
partial_pathlist. See generate_gather_paths().

And we only have sequential + index paths in partial_pathlist, not
incremental sort paths.

IMHO we can do two things:

1) modify generate_gather_paths to also consider incremental sort for
each sorted path, similarly to what create_ordered_paths does

2) modify build_index_paths to also generate an incremental sort path
for each index path

IMHO (1) is the right choice here, because it automatically does the
trick for all other types of ordered paths, not just index scans. So,
something like the attached patch, which gives me plans like this:

...

But I'm not going to claim those are total fixes, it's the minimum I
needed to do to make this particular type of plan work.

Thanks for looking into this!

I intended to apply this to my most recent version of the patch (just
sent a few minutes ago), but when I apply it I noticed that the
partition_aggregate regression tests have several of these failures:

ERROR: could not find pathkey item to sort

I haven't had time to look into the cause yet, so I decided to wait
until the next patch revision.

I wanted to investigate this today, but I can't reprodure it. How are
you building and running the regression tests?

Attached is a patch adding the incremental sort below gather merge, and
also tweaking the costing. But that's mostly for and better planning
decisions, I don't get any pathkey errors even with the first patch.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

parallel-incremental-sort-v2.patchtext/plain; charset=us-asciiDownload

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 3efc807164..d7bf33f64d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2719,6 +2719,8 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	{
 		Path	   *subpath = (Path *) lfirst(lc);
 		GatherMergePath *path;
+		bool		is_sorted;
+		int			presorted_keys;
 
 		if (subpath->pathkeys == NIL)
 			continue;
@@ -2727,6 +2729,26 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 		path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
 										subpath->pathkeys, NULL, rowsp);
 		add_path(rel, &path->path);
+
+		/* consider incremental sort */
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 subpath->pathkeys, &presorted_keys);
+
+		if (!is_sorted && (presorted_keys > 0))
+		{
+			/* Also consider incremental sort. */
+			subpath = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															root->sort_pathkeys,
+															presorted_keys,
+															-1);
+
+			path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+											subpath->pathkeys, NULL, rowsp);
+
+			add_path(rel, &path->path);
+		}
 	}
 }
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7f820e7351..c6aa17ba67 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1875,16 +1875,8 @@ cost_incremental_sort(Path *path,
 				   limit_tuples);
 
 	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
-	if (limit_tuples > 0 && limit_tuples < input_tuples)
-	{
-		output_tuples = limit_tuples;
-		output_groups = floor(output_tuples / group_tuples) + 1;
-	}
-	else
-	{
-		output_tuples = input_tuples;
-		output_groups = input_groups;
-	}
+	output_tuples = input_tuples;
+	output_groups = input_groups;
 
 	/*
 	 * Startup cost of incremental sort is the startup cost of its first group

#134

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#133)

2 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Jul 7, 2019 at 8:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Jul 04, 2019 at 09:29:49AM -0400, James Coleman wrote:

On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:

Unrelated: if you or someone else you know that's more familiar with
the parallel code, I'd be interested in their looking at the patch at
some point, because I have a suspicion it might not be operating in

...

So I've looked into that, and the reason seems fairly simple - when
generating the Gather Merge paths, we only look at paths that are in
partial_pathlist. See generate_gather_paths().

And we only have sequential + index paths in partial_pathlist, not
incremental sort paths.

IMHO we can do two things:

1) modify generate_gather_paths to also consider incremental sort for
each sorted path, similarly to what create_ordered_paths does

2) modify build_index_paths to also generate an incremental sort path
for each index path

IMHO (1) is the right choice here, because it automatically does the
trick for all other types of ordered paths, not just index scans. So,
something like the attached patch, which gives me plans like this:

...

But I'm not going to claim those are total fixes, it's the minimum I
needed to do to make this particular type of plan work.

Thanks for looking into this!

I intended to apply this to my most recent version of the patch (just
sent a few minutes ago), but when I apply it I noticed that the
partition_aggregate regression tests have several of these failures:

ERROR: could not find pathkey item to sort

I haven't had time to look into the cause yet, so I decided to wait
until the next patch revision.

I wanted to investigate this today, but I can't reprodure it. How are
you building and running the regression tests?

Attached is a patch adding the incremental sort below gather merge, and
also tweaking the costing. But that's mostly for and better planning
decisions, I don't get any pathkey errors even with the first patch.

On 12be7f7f997debe4e05e84b69c03ecf7051b1d79 (the last patch I sent,
which is based on top of 5683b34956b4e8da9dccadc2e3a53b86104ebb33), I
did this:

patch -p1 < ~/Downloads/parallel-incremental-sort.patch
<rebuild> (FWIW I configure with ./configure
--prefix=$HOME/postgresql-test --enable-cassert --enable-debug
--enable-depend CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer
-DOPTIMIZER_DEBUG")
make check-world

And I get the attached regression failures.

James Coleman

Attachments:

regression.diffsapplication/octet-stream; name=regression.diffsDownload

diff -U3 /home/jcoleman/Source/postgres/src/test/regress/expected/partition_join.out /home/jcoleman/Source/postgres/src/test/regress/results/partition_join.out
--- /home/jcoleman/Source/postgres/src/test/regress/expected/partition_join.out	2019-06-11 20:41:11.637297274 -0400
+++ /home/jcoleman/Source/postgres/src/test/regress/results/partition_join.out	2019-07-07 09:00:20.781660472 -0400
@@ -65,6 +65,7 @@
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+WARNING:  adding incremental sort + gather merge path
                     QUERY PLAN                    
 --------------------------------------------------
  Sort
@@ -86,6 +87,7 @@
 (16 rows)
 
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+WARNING:  adding incremental sort + gather merge path
       t1      |      t2      
 --------------+--------------
  (0,0,0000)   | (0,0,0000)
@@ -208,6 +210,7 @@
 -- Currently we can't do partitioned join if nullable-side partitions are pruned
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+WARNING:  adding incremental sort + gather merge path
                         QUERY PLAN                         
 -----------------------------------------------------------
  Sort
@@ -228,6 +231,7 @@
 (15 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+WARNING:  adding incremental sort + gather merge path
   a  |  c   |  b  |  c   
 -----+------+-----+------
    0 | 0000 |     | 
@@ -566,6 +570,8 @@
 
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) LEFT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t1.b = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
                           QUERY PLAN                          
 --------------------------------------------------------------
  Sort
@@ -604,6 +610,8 @@
 (33 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) LEFT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t1.b = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
   a  |  c   |  b  |  c   | ?column? | c 
 -----+------+-----+------+----------+---
    0 | 0000 |   0 | 0000 |        0 | 0
@@ -622,6 +630,7 @@
 
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
+WARNING:  adding incremental sort + gather merge path
                             QUERY PLAN                             
 -------------------------------------------------------------------
  Sort
@@ -657,6 +666,7 @@
 (30 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
+WARNING:  adding incremental sort + gather merge path
   a  |  c   |  b  |  c   | ?column? | c 
 -----+------+-----+------+----------+---
    0 | 0000 |   0 | 0000 |        0 | 0
@@ -908,6 +918,7 @@
 
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
+WARNING:  adding incremental sort + gather merge path
                                  QUERY PLAN                                 
 ----------------------------------------------------------------------------
  Sort
@@ -964,6 +975,7 @@
 (51 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b) RIGHT JOIN prt1_e t3 ON (t1.a = (t3.a + t3.b)/2) WHERE t3.c = 0 ORDER BY t1.a, t2.b, t3.a + t3.b;
+WARNING:  adding incremental sort + gather merge path
   a  |  c   |  b  |  c   | ?column? | c 
 -----+------+-----+------+----------+---
    0 | 0000 |   0 | 0000 |        0 | 0
@@ -984,6 +996,7 @@
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+WARNING:  adding incremental sort + gather merge path
                         QUERY PLAN                         
 -----------------------------------------------------------
  Sort
@@ -1007,6 +1020,7 @@
 (18 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+WARNING:  adding incremental sort + gather merge path
   a  |  b  
 -----+-----
    0 |    
@@ -1149,6 +1163,9 @@
 -- test partition matching with N-way join
 EXPLAIN (COSTS OFF)
 SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM plt1 t1, plt2 t2, plt1_e t3 WHERE t1.b = t2.b AND t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
  GroupAggregate
@@ -1186,6 +1203,9 @@
 (32 rows)
 
 SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM plt1 t1, plt2 t2, plt1_e t3 WHERE t1.b = t2.b AND t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
          avg          |         avg          |          avg          |  c   |  c   |   c   
 ----------------------+----------------------+-----------------------+------+------+-------
   24.0000000000000000 |  24.0000000000000000 |   48.0000000000000000 | 0000 | 0000 | A0000
@@ -1293,6 +1313,9 @@
 -- test partition matching with N-way join
 EXPLAIN (COSTS OFF)
 SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM pht1 t1, pht2 t2, pht1_e t3 WHERE t1.b = t2.b AND t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
  GroupAggregate
@@ -1330,6 +1353,9 @@
 (32 rows)
 
 SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM pht1 t1, pht2 t2, pht1_e t3 WHERE t1.b = t2.b AND t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
          avg          |         avg          |         avg          |  c   |  c   |   c   
 ----------------------+----------------------+----------------------+------+------+-------
   24.0000000000000000 |  24.0000000000000000 |  48.0000000000000000 | 0000 | 0000 | A0000
diff -U3 /home/jcoleman/Source/postgres/src/test/regress/expected/partition_aggregate.out /home/jcoleman/Source/postgres/src/test/regress/results/partition_aggregate.out
--- /home/jcoleman/Source/postgres/src/test/regress/expected/partition_aggregate.out	2019-07-03 22:59:05.044423362 -0400
+++ /home/jcoleman/Source/postgres/src/test/regress/results/partition_aggregate.out	2019-07-07 09:00:20.985660449 -0400
@@ -1027,142 +1027,48 @@
 -- PARTITION KEY, thus we will have a partial aggregation for them.
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;
-                            QUERY PLAN                             
--------------------------------------------------------------------
- Sort
-   Sort Key: pagg_tab_ml_p1.a, (sum(pagg_tab_ml_p1.b)), (count(*))
-   ->  Append
-         ->  HashAggregate
-               Group Key: pagg_tab_ml_p1.a
-               Filter: (avg(pagg_tab_ml_p1.b) < '3'::numeric)
-               ->  Seq Scan on pagg_tab_ml_p1
-         ->  Finalize GroupAggregate
-               Group Key: pagg_tab_ml_p2_s1.a
-               Filter: (avg(pagg_tab_ml_p2_s1.b) < '3'::numeric)
-               ->  Sort
-                     Sort Key: pagg_tab_ml_p2_s1.a
-                     ->  Append
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_ml_p2_s1.a
-                                 ->  Seq Scan on pagg_tab_ml_p2_s1
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_ml_p2_s2.a
-                                 ->  Seq Scan on pagg_tab_ml_p2_s2
-         ->  Finalize GroupAggregate
-               Group Key: pagg_tab_ml_p3_s1.a
-               Filter: (avg(pagg_tab_ml_p3_s1.b) < '3'::numeric)
-               ->  Sort
-                     Sort Key: pagg_tab_ml_p3_s1.a
-                     ->  Append
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_ml_p3_s1.a
-                                 ->  Seq Scan on pagg_tab_ml_p3_s1
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_ml_p3_s2.a
-                                 ->  Seq Scan on pagg_tab_ml_p3_s2
-(31 rows)
-
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;
- a  | sum  | count 
-----+------+-------
-  0 |    0 |  1000
-  1 | 1000 |  1000
-  2 | 2000 |  1000
- 10 |    0 |  1000
- 11 | 1000 |  1000
- 12 | 2000 |  1000
- 20 |    0 |  1000
- 21 | 1000 |  1000
- 22 | 2000 |  1000
-(9 rows)
-
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 -- Partial aggregation at all levels as GROUP BY clause does not match with
 -- PARTITION KEY
 EXPLAIN (COSTS OFF)
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b ORDER BY 1, 2, 3;
-                            QUERY PLAN                             
--------------------------------------------------------------------
- Sort
-   Sort Key: pagg_tab_ml_p1.b, (sum(pagg_tab_ml_p1.a)), (count(*))
-   ->  Finalize GroupAggregate
-         Group Key: pagg_tab_ml_p1.b
-         ->  Sort
-               Sort Key: pagg_tab_ml_p1.b
-               ->  Append
-                     ->  Partial HashAggregate
-                           Group Key: pagg_tab_ml_p1.b
-                           ->  Seq Scan on pagg_tab_ml_p1
-                     ->  Partial HashAggregate
-                           Group Key: pagg_tab_ml_p2_s1.b
-                           ->  Seq Scan on pagg_tab_ml_p2_s1
-                     ->  Partial HashAggregate
-                           Group Key: pagg_tab_ml_p2_s2.b
-                           ->  Seq Scan on pagg_tab_ml_p2_s2
-                     ->  Partial HashAggregate
-                           Group Key: pagg_tab_ml_p3_s1.b
-                           ->  Seq Scan on pagg_tab_ml_p3_s1
-                     ->  Partial HashAggregate
-                           Group Key: pagg_tab_ml_p3_s2.b
-                           ->  Seq Scan on pagg_tab_ml_p3_s2
-(22 rows)
-
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b HAVING avg(a) < 15 ORDER BY 1, 2, 3;
- b |  sum  | count 
----+-------+-------
- 0 | 30000 |  3000
- 1 | 33000 |  3000
- 2 | 36000 |  3000
- 3 | 39000 |  3000
- 4 | 42000 |  3000
-(5 rows)
-
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 -- Full aggregation at all levels as GROUP BY clause matches with PARTITION KEY
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a, b, c HAVING avg(b) > 7 ORDER BY 1, 2, 3;
-                                       QUERY PLAN                                       
-----------------------------------------------------------------------------------------
- Sort
-   Sort Key: pagg_tab_ml_p1.a, (sum(pagg_tab_ml_p1.b)), (count(*))
-   ->  Append
-         ->  HashAggregate
-               Group Key: pagg_tab_ml_p1.a, pagg_tab_ml_p1.b, pagg_tab_ml_p1.c
-               Filter: (avg(pagg_tab_ml_p1.b) > '7'::numeric)
-               ->  Seq Scan on pagg_tab_ml_p1
-         ->  HashAggregate
-               Group Key: pagg_tab_ml_p2_s1.a, pagg_tab_ml_p2_s1.b, pagg_tab_ml_p2_s1.c
-               Filter: (avg(pagg_tab_ml_p2_s1.b) > '7'::numeric)
-               ->  Seq Scan on pagg_tab_ml_p2_s1
-         ->  HashAggregate
-               Group Key: pagg_tab_ml_p2_s2.a, pagg_tab_ml_p2_s2.b, pagg_tab_ml_p2_s2.c
-               Filter: (avg(pagg_tab_ml_p2_s2.b) > '7'::numeric)
-               ->  Seq Scan on pagg_tab_ml_p2_s2
-         ->  HashAggregate
-               Group Key: pagg_tab_ml_p3_s1.a, pagg_tab_ml_p3_s1.b, pagg_tab_ml_p3_s1.c
-               Filter: (avg(pagg_tab_ml_p3_s1.b) > '7'::numeric)
-               ->  Seq Scan on pagg_tab_ml_p3_s1
-         ->  HashAggregate
-               Group Key: pagg_tab_ml_p3_s2.a, pagg_tab_ml_p3_s2.b, pagg_tab_ml_p3_s2.c
-               Filter: (avg(pagg_tab_ml_p3_s2.b) > '7'::numeric)
-               ->  Seq Scan on pagg_tab_ml_p3_s2
-(23 rows)
-
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a, b, c HAVING avg(b) > 7 ORDER BY 1, 2, 3;
- a  | sum  | count 
-----+------+-------
-  8 | 4000 |   500
-  8 | 4000 |   500
-  9 | 4500 |   500
-  9 | 4500 |   500
- 18 | 4000 |   500
- 18 | 4000 |   500
- 19 | 4500 |   500
- 19 | 4500 |   500
- 28 | 4000 |   500
- 28 | 4000 |   500
- 29 | 4500 |   500
- 29 | 4500 |   500
-(12 rows)
-
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 -- Parallelism within partitionwise aggregates
 SET min_parallel_table_scan_size TO '8kB';
 SET parallel_setup_cost TO 0;
@@ -1171,156 +1077,48 @@
 -- PARTITION KEY, thus we will have a partial aggregation for them.
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;
-                                    QUERY PLAN                                    
-----------------------------------------------------------------------------------
- Sort
-   Sort Key: pagg_tab_ml_p1.a, (sum(pagg_tab_ml_p1.b)), (count(*))
-   ->  Append
-         ->  Finalize GroupAggregate
-               Group Key: pagg_tab_ml_p1.a
-               Filter: (avg(pagg_tab_ml_p1.b) < '3'::numeric)
-               ->  Gather Merge
-                     Workers Planned: 2
-                     ->  Sort
-                           Sort Key: pagg_tab_ml_p1.a
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_ml_p1.a
-                                 ->  Parallel Seq Scan on pagg_tab_ml_p1
-         ->  Finalize GroupAggregate
-               Group Key: pagg_tab_ml_p2_s1.a
-               Filter: (avg(pagg_tab_ml_p2_s1.b) < '3'::numeric)
-               ->  Gather Merge
-                     Workers Planned: 2
-                     ->  Sort
-                           Sort Key: pagg_tab_ml_p2_s1.a
-                           ->  Parallel Append
-                                 ->  Partial HashAggregate
-                                       Group Key: pagg_tab_ml_p2_s1.a
-                                       ->  Parallel Seq Scan on pagg_tab_ml_p2_s1
-                                 ->  Partial HashAggregate
-                                       Group Key: pagg_tab_ml_p2_s2.a
-                                       ->  Parallel Seq Scan on pagg_tab_ml_p2_s2
-         ->  Finalize GroupAggregate
-               Group Key: pagg_tab_ml_p3_s1.a
-               Filter: (avg(pagg_tab_ml_p3_s1.b) < '3'::numeric)
-               ->  Gather Merge
-                     Workers Planned: 2
-                     ->  Sort
-                           Sort Key: pagg_tab_ml_p3_s1.a
-                           ->  Parallel Append
-                                 ->  Partial HashAggregate
-                                       Group Key: pagg_tab_ml_p3_s1.a
-                                       ->  Parallel Seq Scan on pagg_tab_ml_p3_s1
-                                 ->  Partial HashAggregate
-                                       Group Key: pagg_tab_ml_p3_s2.a
-                                       ->  Parallel Seq Scan on pagg_tab_ml_p3_s2
-(41 rows)
-
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;
- a  | sum  | count 
-----+------+-------
-  0 |    0 |  1000
-  1 | 1000 |  1000
-  2 | 2000 |  1000
- 10 |    0 |  1000
- 11 | 1000 |  1000
- 12 | 2000 |  1000
- 20 |    0 |  1000
- 21 | 1000 |  1000
- 22 | 2000 |  1000
-(9 rows)
-
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 -- Partial aggregation at all levels as GROUP BY clause does not match with
 -- PARTITION KEY
 EXPLAIN (COSTS OFF)
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b ORDER BY 1, 2, 3;
-                                 QUERY PLAN                                 
-----------------------------------------------------------------------------
- Sort
-   Sort Key: pagg_tab_ml_p1.b, (sum(pagg_tab_ml_p1.a)), (count(*))
-   ->  Finalize GroupAggregate
-         Group Key: pagg_tab_ml_p1.b
-         ->  Gather Merge
-               Workers Planned: 2
-               ->  Sort
-                     Sort Key: pagg_tab_ml_p1.b
-                     ->  Parallel Append
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_ml_p1.b
-                                 ->  Parallel Seq Scan on pagg_tab_ml_p1
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_ml_p2_s1.b
-                                 ->  Parallel Seq Scan on pagg_tab_ml_p2_s1
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_ml_p2_s2.b
-                                 ->  Parallel Seq Scan on pagg_tab_ml_p2_s2
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_ml_p3_s1.b
-                                 ->  Parallel Seq Scan on pagg_tab_ml_p3_s1
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_ml_p3_s2.b
-                                 ->  Parallel Seq Scan on pagg_tab_ml_p3_s2
-(24 rows)
-
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 SELECT b, sum(a), count(*) FROM pagg_tab_ml GROUP BY b HAVING avg(a) < 15 ORDER BY 1, 2, 3;
- b |  sum  | count 
----+-------+-------
- 0 | 30000 |  3000
- 1 | 33000 |  3000
- 2 | 36000 |  3000
- 3 | 39000 |  3000
- 4 | 42000 |  3000
-(5 rows)
-
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 -- Full aggregation at all levels as GROUP BY clause matches with PARTITION KEY
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a, b, c HAVING avg(b) > 7 ORDER BY 1, 2, 3;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
- Gather Merge
-   Workers Planned: 2
-   ->  Sort
-         Sort Key: pagg_tab_ml_p1.a, (sum(pagg_tab_ml_p1.b)), (count(*))
-         ->  Parallel Append
-               ->  HashAggregate
-                     Group Key: pagg_tab_ml_p1.a, pagg_tab_ml_p1.b, pagg_tab_ml_p1.c
-                     Filter: (avg(pagg_tab_ml_p1.b) > '7'::numeric)
-                     ->  Seq Scan on pagg_tab_ml_p1
-               ->  HashAggregate
-                     Group Key: pagg_tab_ml_p2_s1.a, pagg_tab_ml_p2_s1.b, pagg_tab_ml_p2_s1.c
-                     Filter: (avg(pagg_tab_ml_p2_s1.b) > '7'::numeric)
-                     ->  Seq Scan on pagg_tab_ml_p2_s1
-               ->  HashAggregate
-                     Group Key: pagg_tab_ml_p2_s2.a, pagg_tab_ml_p2_s2.b, pagg_tab_ml_p2_s2.c
-                     Filter: (avg(pagg_tab_ml_p2_s2.b) > '7'::numeric)
-                     ->  Seq Scan on pagg_tab_ml_p2_s2
-               ->  HashAggregate
-                     Group Key: pagg_tab_ml_p3_s1.a, pagg_tab_ml_p3_s1.b, pagg_tab_ml_p3_s1.c
-                     Filter: (avg(pagg_tab_ml_p3_s1.b) > '7'::numeric)
-                     ->  Seq Scan on pagg_tab_ml_p3_s1
-               ->  HashAggregate
-                     Group Key: pagg_tab_ml_p3_s2.a, pagg_tab_ml_p3_s2.b, pagg_tab_ml_p3_s2.c
-                     Filter: (avg(pagg_tab_ml_p3_s2.b) > '7'::numeric)
-                     ->  Seq Scan on pagg_tab_ml_p3_s2
-(25 rows)
-
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a, b, c HAVING avg(b) > 7 ORDER BY 1, 2, 3;
- a  | sum  | count 
-----+------+-------
-  8 | 4000 |   500
-  8 | 4000 |   500
-  9 | 4500 |   500
-  9 | 4500 |   500
- 18 | 4000 |   500
- 18 | 4000 |   500
- 19 | 4500 |   500
- 19 | 4500 |   500
- 28 | 4000 |   500
- 28 | 4000 |   500
- 29 | 4500 |   500
- 29 | 4500 |   500
-(12 rows)
-
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 -- Parallelism within partitionwise aggregates (single level)
 -- Add few parallel setup cost, so that we will see a plan which gathers
 -- partially created paths even for full aggregation and sticks a single Gather
@@ -1339,177 +1137,54 @@
 -- When GROUP BY clause matches; full aggregation is performed for each partition.
 EXPLAIN (COSTS OFF)
 SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: pagg_tab_para_p1.x, (sum(pagg_tab_para_p1.y)), (avg(pagg_tab_para_p1.y))
-   ->  Finalize GroupAggregate
-         Group Key: pagg_tab_para_p1.x
-         Filter: (avg(pagg_tab_para_p1.y) < '7'::numeric)
-         ->  Gather Merge
-               Workers Planned: 2
-               ->  Sort
-                     Sort Key: pagg_tab_para_p1.x
-                     ->  Parallel Append
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para_p1.x
-                                 ->  Parallel Seq Scan on pagg_tab_para_p1
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para_p2.x
-                                 ->  Parallel Seq Scan on pagg_tab_para_p2
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para_p3.x
-                                 ->  Parallel Seq Scan on pagg_tab_para_p3
-(19 rows)
-
-SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
- x  | sum  |        avg         | count 
-----+------+--------------------+-------
-  0 | 5000 | 5.0000000000000000 |  1000
-  1 | 6000 | 6.0000000000000000 |  1000
- 10 | 5000 | 5.0000000000000000 |  1000
- 11 | 6000 | 6.0000000000000000 |  1000
- 20 | 5000 | 5.0000000000000000 |  1000
- 21 | 6000 | 6.0000000000000000 |  1000
-(6 rows)
-
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
+SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 -- When GROUP BY clause does not match; partial aggregation is performed for each partition.
 EXPLAIN (COSTS OFF)
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: pagg_tab_para_p1.y, (sum(pagg_tab_para_p1.x)), (avg(pagg_tab_para_p1.x))
-   ->  Finalize GroupAggregate
-         Group Key: pagg_tab_para_p1.y
-         Filter: (avg(pagg_tab_para_p1.x) < '12'::numeric)
-         ->  Gather Merge
-               Workers Planned: 2
-               ->  Sort
-                     Sort Key: pagg_tab_para_p1.y
-                     ->  Parallel Append
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para_p1.y
-                                 ->  Parallel Seq Scan on pagg_tab_para_p1
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para_p2.y
-                                 ->  Parallel Seq Scan on pagg_tab_para_p2
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para_p3.y
-                                 ->  Parallel Seq Scan on pagg_tab_para_p3
-(19 rows)
-
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
- y  |  sum  |         avg         | count 
-----+-------+---------------------+-------
-  0 | 15000 | 10.0000000000000000 |  1500
-  1 | 16500 | 11.0000000000000000 |  1500
- 10 | 15000 | 10.0000000000000000 |  1500
- 11 | 16500 | 11.0000000000000000 |  1500
-(4 rows)
-
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 -- Test when parent can produce parallel paths but not any (or some) of its children
 ALTER TABLE pagg_tab_para_p1 SET (parallel_workers = 0);
 ALTER TABLE pagg_tab_para_p3 SET (parallel_workers = 0);
 ANALYZE pagg_tab_para;
 EXPLAIN (COSTS OFF)
 SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: pagg_tab_para_p1.x, (sum(pagg_tab_para_p1.y)), (avg(pagg_tab_para_p1.y))
-   ->  Finalize GroupAggregate
-         Group Key: pagg_tab_para_p1.x
-         Filter: (avg(pagg_tab_para_p1.y) < '7'::numeric)
-         ->  Gather Merge
-               Workers Planned: 2
-               ->  Sort
-                     Sort Key: pagg_tab_para_p1.x
-                     ->  Partial HashAggregate
-                           Group Key: pagg_tab_para_p1.x
-                           ->  Parallel Append
-                                 ->  Seq Scan on pagg_tab_para_p1
-                                 ->  Seq Scan on pagg_tab_para_p3
-                                 ->  Parallel Seq Scan on pagg_tab_para_p2
-(15 rows)
-
-SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
- x  | sum  |        avg         | count 
-----+------+--------------------+-------
-  0 | 5000 | 5.0000000000000000 |  1000
-  1 | 6000 | 6.0000000000000000 |  1000
- 10 | 5000 | 5.0000000000000000 |  1000
- 11 | 6000 | 6.0000000000000000 |  1000
- 20 | 5000 | 5.0000000000000000 |  1000
- 21 | 6000 | 6.0000000000000000 |  1000
-(6 rows)
-
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
+SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
+WARNING:  adding incremental sort + gather merge path
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 ALTER TABLE pagg_tab_para_p2 SET (parallel_workers = 0);
 ANALYZE pagg_tab_para;
 EXPLAIN (COSTS OFF)
 SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: pagg_tab_para_p1.x, (sum(pagg_tab_para_p1.y)), (avg(pagg_tab_para_p1.y))
-   ->  Finalize GroupAggregate
-         Group Key: pagg_tab_para_p1.x
-         Filter: (avg(pagg_tab_para_p1.y) < '7'::numeric)
-         ->  Gather Merge
-               Workers Planned: 2
-               ->  Sort
-                     Sort Key: pagg_tab_para_p1.x
-                     ->  Partial HashAggregate
-                           Group Key: pagg_tab_para_p1.x
-                           ->  Parallel Append
-                                 ->  Seq Scan on pagg_tab_para_p1
-                                 ->  Seq Scan on pagg_tab_para_p2
-                                 ->  Seq Scan on pagg_tab_para_p3
-(15 rows)
-
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
- x  | sum  |        avg         | count 
-----+------+--------------------+-------
-  0 | 5000 | 5.0000000000000000 |  1000
-  1 | 6000 | 6.0000000000000000 |  1000
- 10 | 5000 | 5.0000000000000000 |  1000
- 11 | 6000 | 6.0000000000000000 |  1000
- 20 | 5000 | 5.0000000000000000 |  1000
- 21 | 6000 | 6.0000000000000000 |  1000
-(6 rows)
-
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 -- Reset parallelism parameters to get partitionwise aggregation plan.
 RESET min_parallel_table_scan_size;
 RESET parallel_setup_cost;
 EXPLAIN (COSTS OFF)
 SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: pagg_tab_para_p1.x, (sum(pagg_tab_para_p1.y)), (avg(pagg_tab_para_p1.y))
-   ->  Append
-         ->  HashAggregate
-               Group Key: pagg_tab_para_p1.x
-               Filter: (avg(pagg_tab_para_p1.y) < '7'::numeric)
-               ->  Seq Scan on pagg_tab_para_p1
-         ->  HashAggregate
-               Group Key: pagg_tab_para_p2.x
-               Filter: (avg(pagg_tab_para_p2.y) < '7'::numeric)
-               ->  Seq Scan on pagg_tab_para_p2
-         ->  HashAggregate
-               Group Key: pagg_tab_para_p3.x
-               Filter: (avg(pagg_tab_para_p3.y) < '7'::numeric)
-               ->  Seq Scan on pagg_tab_para_p3
-(15 rows)
-
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort
 SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
- x  | sum  |        avg         | count 
-----+------+--------------------+-------
-  0 | 5000 | 5.0000000000000000 |  1000
-  1 | 6000 | 6.0000000000000000 |  1000
- 10 | 5000 | 5.0000000000000000 |  1000
- 11 | 6000 | 6.0000000000000000 |  1000
- 20 | 5000 | 5.0000000000000000 |  1000
- 21 | 6000 | 6.0000000000000000 |  1000
-(6 rows)
-
+WARNING:  adding incremental sort + gather merge path
+ERROR:  could not find pathkey item to sort

regression.outapplication/octet-stream; name=regression.outDownload

#135

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#134)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Jul 07, 2019 at 09:01:43AM -0400, James Coleman wrote:

On Sun, Jul 7, 2019 at 8:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Jul 04, 2019 at 09:29:49AM -0400, James Coleman wrote:

On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:

Unrelated: if you or someone else you know that's more familiar with
the parallel code, I'd be interested in their looking at the patch at
some point, because I have a suspicion it might not be operating in

...

So I've looked into that, and the reason seems fairly simple - when
generating the Gather Merge paths, we only look at paths that are in
partial_pathlist. See generate_gather_paths().

And we only have sequential + index paths in partial_pathlist, not
incremental sort paths.

IMHO we can do two things:

1) modify generate_gather_paths to also consider incremental sort for
each sorted path, similarly to what create_ordered_paths does

2) modify build_index_paths to also generate an incremental sort path
for each index path

IMHO (1) is the right choice here, because it automatically does the
trick for all other types of ordered paths, not just index scans. So,
something like the attached patch, which gives me plans like this:

...

But I'm not going to claim those are total fixes, it's the minimum I
needed to do to make this particular type of plan work.

Thanks for looking into this!

I intended to apply this to my most recent version of the patch (just
sent a few minutes ago), but when I apply it I noticed that the
partition_aggregate regression tests have several of these failures:

ERROR: could not find pathkey item to sort

I haven't had time to look into the cause yet, so I decided to wait
until the next patch revision.

I wanted to investigate this today, but I can't reprodure it. How are
you building and running the regression tests?

Attached is a patch adding the incremental sort below gather merge, and
also tweaking the costing. But that's mostly for and better planning
decisions, I don't get any pathkey errors even with the first patch.

On 12be7f7f997debe4e05e84b69c03ecf7051b1d79 (the last patch I sent,
which is based on top of 5683b34956b4e8da9dccadc2e3a53b86104ebb33), I
did this:

patch -p1 < ~/Downloads/parallel-incremental-sort.patch
<rebuild> (FWIW I configure with ./configure
--prefix=$HOME/postgresql-test --enable-cassert --enable-debug
--enable-depend CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer
-DOPTIMIZER_DEBUG")
make check-world

And I get the attached regression failures.

OK, thanks. Apparently it's the costing changes that make it go away, if
I try just the patch that tweaks generate_gather_paths() I see the same
failures. The failure happens during plan construction, so I think the
costing changes simply mean the path with incremental sort end up not
being the cheapest one (for the problematic queries), but that's just
pure luck - it's definitely an issue that needs fixing.

That error message is triggered in two places in createplan.c, and after
changing them to Assert(false) I get a core dump with this backtrace:

#0 0x0000702b3328857f in raise () from /lib64/libc.so.6
#1 0x0000702b33272895 in abort () from /lib64/libc.so.6
#2 0x0000000000a59a9d in ExceptionalCondition (conditionName=0xc52e84 "!(0)", errorType=0xc51f96 "FailedAssertion", fileName=0xc51fe6 "createplan.c", lineNumber=5937) at assert.c:54
#3 0x00000000007d4ab5 in prepare_sort_from_pathkeys (lefttree=0x2bbbce0, pathkeys=0x2b7a130, relids=0x0, reqColIdx=0x0, adjust_tlist_in_place=false, p_numsortkeys=0x7ffe1abcfd6c, p_sortColIdx=0x7ffe1abcfd60, p_sortOperators=0x7ffe1abcfd58, p_collations=0x7ffe1abcfd50,
p_nullsFirst=0x7ffe1abcfd48) at createplan.c:5937
#4 0x00000000007d4e7f in make_incrementalsort_from_pathkeys (lefttree=0x2bbbce0, pathkeys=0x2b7a130, relids=0x0, presortedCols=1) at createplan.c:6101
#5 0x00000000007cdd3f in create_incrementalsort_plan (root=0x2b787c0, best_path=0x2bb92b0, flags=1) at createplan.c:2019
#6 0x00000000007cb7ad in create_plan_recurse (root=0x2b787c0, best_path=0x2bb92b0, flags=1) at createplan.c:469
#7 0x00000000007cd778 in create_gather_merge_plan (root=0x2b787c0, best_path=0x2bb94a0) at createplan.c:1764
#8 0x00000000007cb8fb in create_plan_recurse (root=0x2b787c0, best_path=0x2bb94a0, flags=4) at createplan.c:516
#9 0x00000000007cdf10 in create_agg_plan (root=0x2b787c0, best_path=0x2bb9b28) at createplan.c:2115
#10 0x00000000007cb834 in create_plan_recurse (root=0x2b787c0, best_path=0x2bb9b28, flags=3) at createplan.c:484
#11 0x00000000007cdc16 in create_sort_plan (root=0x2b787c0, best_path=0x2bba1e8, flags=1) at createplan.c:1986
#12 0x00000000007cb78e in create_plan_recurse (root=0x2b787c0, best_path=0x2bba1e8, flags=1) at createplan.c:464
#13 0x00000000007cb4ae in create_plan (root=0x2b787c0, best_path=0x2bba1e8) at createplan.c:330
#14 0x00000000007db63c in standard_planner (parse=0x2bf5bc8, cursorOptions=256, boundParams=0x0) at planner.c:413
#15 0x00000000007db3b4 in planner (parse=0x2bf5bc8, cursorOptions=256, boundParams=0x0) at planner.c:275
#16 0x00000000008e404f in pg_plan_query (querytree=0x2bf5bc8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#17 0x0000000000657afa in ExplainOneQuery (query=0x2bf5bc8, cursorOptions=256, into=0x0, es=0x2bf5fc0, queryString=0x2a74be8 "EXPLAIN (COSTS OFF)\nSELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;", params=0x0, queryEnv=0x0) at explain.c:371
#18 0x0000000000657804 in ExplainQuery (pstate=0x2a95ff8, stmt=0x2a76628, queryString=0x2a74be8 "EXPLAIN (COSTS OFF)\nSELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;", params=0x0, queryEnv=0x0, dest=0x2a95f60) at explain.c:259
#19 0x00000000008ec75e in standard_ProcessUtility (pstmt=0x2b8a050, queryString=0x2a74be8 "EXPLAIN (COSTS OFF)\nSELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2a95f60,
completionTag=0x7ffe1abd0430 "") at utility.c:675
#20 0x00000000008ebfbe in ProcessUtility (pstmt=0x2b8a050, queryString=0x2a74be8 "EXPLAIN (COSTS OFF)\nSELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2a95f60,
completionTag=0x7ffe1abd0430 "") at utility.c:360
#21 0x00000000008eafab in PortalRunUtility (portal=0x2adc558, pstmt=0x2b8a050, isTopLevel=true, setHoldSnapshot=true, dest=0x2a95f60, completionTag=0x7ffe1abd0430 "") at pquery.c:1175
#22 0x00000000008eacb2 in FillPortalStore (portal=0x2adc558, isTopLevel=true) at pquery.c:1035
#23 0x00000000008ea60e in PortalRun (portal=0x2adc558, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x2b8a148, altdest=0x2b8a148, completionTag=0x7ffe1abd0620 "") at pquery.c:765
#24 0x00000000008e45be in exec_simple_query (query_string=0x2a74be8 "EXPLAIN (COSTS OFF)\nSELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;") at postgres.c:1215
#25 0x00000000008e8912 in PostgresMain (argc=1, argv=0x2aa07c0, dbname=0x2aa0530 "regression", username=0x2a707e8 "user") at postgres.c:4249
#26 0x000000000083ed83 in BackendRun (port=0x2a99c50) at postmaster.c:4431
#27 0x000000000083e551 in BackendStartup (port=0x2a99c50) at postmaster.c:4122
#28 0x000000000083a977 in ServerLoop () at postmaster.c:1704
#29 0x000000000083a223 in PostmasterMain (argc=8, argv=0x2a6e670) at postmaster.c:1377
#30 0x000000000075a823 in main (argc=8, argv=0x2a6e670) at main.c:228

I think it's pretty obvious what's happening - I'll resist the urge
to just write "The proof is obvious and is left to the reader as a
homework." because I always despised that during math lectures ;-)

We're running query like this:

SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3

but we're trying to add the incremental sort *before* the aggregation,
because the optimizer also considers group aggregate with a sorted
input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
actually can do this, but clearly that's nonsense, because we can't
possibly know the aggregates yet. Hence the error.

If this is the actual issue, we need to ensure we actually can evaluate
all the pathkeys. I don't know how to do that yet. I thought that maybe
we should modify pathkeys_common_contained_in() to set presorted_keys to
0 in this case.

But then I started wondering why we don't see this issue even for
regular (non-incremental-sort) paths built in create_ordered_paths().
How come we don't see these failures there? I've modified costing to
make all incremental sort paths very cheap, and still nothing.

So presumably there's a check elsewhere (either implicit or explicit),
because create_ordered_paths() uses pathkeys_common_contained_in() and
does not have the same issue.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#136

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#135)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

We're running query like this:

SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3

but we're trying to add the incremental sort *before* the aggregation,
because the optimizer also considers group aggregate with a sorted
input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
actually can do this, but clearly that's nonsense, because we can't
possibly know the aggregates yet. Hence the error.

If this is the actual issue, we need to ensure we actually can evaluate
all the pathkeys. I don't know how to do that yet. I thought that maybe
we should modify pathkeys_common_contained_in() to set presorted_keys to
0 in this case.

But then I started wondering why we don't see this issue even for
regular (non-incremental-sort) paths built in create_ordered_paths().
How come we don't see these failures there? I've modified costing to
make all incremental sort paths very cheap, and still nothing.

I assume you mean you modified costing to make regular sort paths very cheap?

So presumably there's a check elsewhere (either implicit or explicit),
because create_ordered_paths() uses pathkeys_common_contained_in() and
does not have the same issue.

Given this comment in create_ordered_paths():

generate_gather_paths() will have already generated a simple Gather
path for the best parallel path, if any, and the loop above will have
considered sorting it. Similarly, generate_gather_paths() will also
have generated order-preserving Gather Merge plans which can be used
without sorting if they happen to match the sort_pathkeys, and the loop
above will have handled those as well. However, there's one more
possibility: it may make sense to sort the cheapest partial path
according to the required output order and then use Gather Merge.

my understanding is that generate_gather_paths() only considers paths
that already happen to be sorted (not explicit sorts), so I'm
wondering if it would make more sense for the incremental sort path
creation for this case to live alongside the explicit ordered path
creation in create_ordered_paths() rather than in
generate_gather_paths().

If I'm understanding what you're saying properly, I think you'd
expected create_ordered_paths() to be roughly similar in what it
considers as partial paths and so have the same problem, and I haven't
yet read enough of the code to understand if my proposed change
actually has any impact on the issue we're discussing, but it seems to
me that it at least fits more with what the comments imply.

I'll try to look at it a bit more later also, but at the moment other
work calls.

James Coleman

#137

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#136)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:

On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

We're running query like this:

SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3

but we're trying to add the incremental sort *before* the aggregation,
because the optimizer also considers group aggregate with a sorted
input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
actually can do this, but clearly that's nonsense, because we can't
possibly know the aggregates yet. Hence the error.

If this is the actual issue, we need to ensure we actually can evaluate
all the pathkeys. I don't know how to do that yet. I thought that maybe
we should modify pathkeys_common_contained_in() to set presorted_keys to
0 in this case.

But then I started wondering why we don't see this issue even for
regular (non-incremental-sort) paths built in create_ordered_paths().
How come we don't see these failures there? I've modified costing to
make all incremental sort paths very cheap, and still nothing.

I assume you mean you modified costing to make regular sort paths very cheap?

No, I mean costing of incremental sort paths, so that they end up being
the cheapest ones. If some other path is cheaper, we won't see the error
because it only happens when building plan from the cheapest path.

So presumably there's a check elsewhere (either implicit or explicit),
because create_ordered_paths() uses pathkeys_common_contained_in() and
does not have the same issue.

Given this comment in create_ordered_paths():

generate_gather_paths() will have already generated a simple Gather
path for the best parallel path, if any, and the loop above will have
considered sorting it. Similarly, generate_gather_paths() will also
have generated order-preserving Gather Merge plans which can be used
without sorting if they happen to match the sort_pathkeys, and the loop
above will have handled those as well. However, there's one more
possibility: it may make sense to sort the cheapest partial path
according to the required output order and then use Gather Merge.

my understanding is that generate_gather_paths() only considers paths
that already happen to be sorted (not explicit sorts), so I'm
wondering if it would make more sense for the incremental sort path
creation for this case to live alongside the explicit ordered path
creation in create_ordered_paths() rather than in
generate_gather_paths().

How would that solve the issue? Also, we're building a gather path, so
I think generate_gather_paths() is the right place where to do it. And
we're not changing the semantics of generate_gather_paths() - the result
path should be sorted "correctly" with respect to sort_pathkeys.

If I'm understanding what you're saying properly, I think you'd
expected create_ordered_paths() to be roughly similar in what it
considers as partial paths and so have the same problem, and I haven't
yet read enough of the code to understand if my proposed change
actually has any impact on the issue we're discussing, but it seems to
me that it at least fits more with what the comments imply.

Roughly. AFAICS the problem is that we're trying to use pathkeys that are
only valid for the (aggregated) upper relation, not before it.

I have to admit I've never quite understood this pathkeys business. I
mean, I know what pathkeys are for, but clearly it's not valid to look at
root->sort_pathkeys at this place. Maybe there's some other field we
should be looking at instead, or maybe it's ensured by grouping_planner in
some implicit way ...

I.e. the question is - when you do a query like

SELECT a, count(*) FROM t GROUP BY a ORDER BY a, count(*);

and cost the incremental sort extremely cheap, how come we don't end up
with the same issue?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#138

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#137)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:

On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

We're running query like this:

SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3

but we're trying to add the incremental sort *before* the aggregation,
because the optimizer also considers group aggregate with a sorted
input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
actually can do this, but clearly that's nonsense, because we can't
possibly know the aggregates yet. Hence the error.

If this is the actual issue, we need to ensure we actually can evaluate
all the pathkeys. I don't know how to do that yet. I thought that maybe
we should modify pathkeys_common_contained_in() to set presorted_keys to
0 in this case.

But then I started wondering why we don't see this issue even for
regular (non-incremental-sort) paths built in create_ordered_paths().
How come we don't see these failures there? I've modified costing to
make all incremental sort paths very cheap, and still nothing.

I assume you mean you modified costing to make regular sort paths very cheap?

No, I mean costing of incremental sort paths, so that they end up being
the cheapest ones. If some other path is cheaper, we won't see the error
because it only happens when building plan from the cheapest path.

Ah, I misunderstood as you trying to figure out a way to try to cause
the same problem with a regular sort.

So presumably there's a check elsewhere (either implicit or explicit),
because create_ordered_paths() uses pathkeys_common_contained_in() and
does not have the same issue.

Given this comment in create_ordered_paths():

generate_gather_paths() will have already generated a simple Gather
path for the best parallel path, if any, and the loop above will have
considered sorting it. Similarly, generate_gather_paths() will also
have generated order-preserving Gather Merge plans which can be used
without sorting if they happen to match the sort_pathkeys, and the loop
above will have handled those as well. However, there's one more
possibility: it may make sense to sort the cheapest partial path
according to the required output order and then use Gather Merge.

my understanding is that generate_gather_paths() only considers paths
that already happen to be sorted (not explicit sorts), so I'm
wondering if it would make more sense for the incremental sort path
creation for this case to live alongside the explicit ordered path
creation in create_ordered_paths() rather than in
generate_gather_paths().

How would that solve the issue? Also, we're building a gather path, so
I think generate_gather_paths() is the right place where to do it. And
we're not changing the semantics of generate_gather_paths() - the result
path should be sorted "correctly" with respect to sort_pathkeys.

Does that imply what the explicit sort in create_ordered_paths() is in
the wrong spot?

Or, to put it another way, do you think that both kinds of sorts
should be added in the same place? It seems confusing to me that
they'd be split between the two methods (unless I'm completely
misunderstanding how the two work).

I'm not saying it would solve the issue here; just noting that the
division of labor seemed odd to me at first read through.

James Coleman

#139

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#138)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:

On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

We're running query like this:

SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3

but we're trying to add the incremental sort *before* the aggregation,
because the optimizer also considers group aggregate with a sorted
input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
actually can do this, but clearly that's nonsense, because we can't
possibly know the aggregates yet. Hence the error.

If this is the actual issue, we need to ensure we actually can evaluate
all the pathkeys. I don't know how to do that yet. I thought that maybe
we should modify pathkeys_common_contained_in() to set presorted_keys to
0 in this case.

But then I started wondering why we don't see this issue even for
regular (non-incremental-sort) paths built in create_ordered_paths().
How come we don't see these failures there? I've modified costing to
make all incremental sort paths very cheap, and still nothing.

I assume you mean you modified costing to make regular sort paths very cheap?

No, I mean costing of incremental sort paths, so that they end up being
the cheapest ones. If some other path is cheaper, we won't see the error
because it only happens when building plan from the cheapest path.

Ah, I misunderstood as you trying to figure out a way to try to cause
the same problem with a regular sort.

So presumably there's a check elsewhere (either implicit or explicit),
because create_ordered_paths() uses pathkeys_common_contained_in() and
does not have the same issue.

Given this comment in create_ordered_paths():

generate_gather_paths() will have already generated a simple Gather
path for the best parallel path, if any, and the loop above will have
considered sorting it. Similarly, generate_gather_paths() will also
have generated order-preserving Gather Merge plans which can be used
without sorting if they happen to match the sort_pathkeys, and the loop
above will have handled those as well. However, there's one more
possibility: it may make sense to sort the cheapest partial path
according to the required output order and then use Gather Merge.

my understanding is that generate_gather_paths() only considers paths
that already happen to be sorted (not explicit sorts), so I'm
wondering if it would make more sense for the incremental sort path
creation for this case to live alongside the explicit ordered path
creation in create_ordered_paths() rather than in
generate_gather_paths().

How would that solve the issue? Also, we're building a gather path, so
I think generate_gather_paths() is the right place where to do it. And
we're not changing the semantics of generate_gather_paths() - the result
path should be sorted "correctly" with respect to sort_pathkeys.

Does that imply what the explicit sort in create_ordered_paths() is in
the wrong spot?

I think those are essentially the right places where to do this sort of
stuff. Maybe there's a better place, but I don't think those places are
somehow wrong.

Or, to put it another way, do you think that both kinds of sorts
should be added in the same place? It seems confusing to me that
they'd be split between the two methods (unless I'm completely
misunderstanding how the two work).

The paths built in those two places were very different in one aspect:

1) generate_gather_paths only ever looked at pathkeys for the subpath, it
never even looked at ordering expected by paths above it (or the query as
a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
to maintain ordering of the different subpaths.

2) create_oredered_paths is meant to enforce ordering needed by upper
parts of the plan - either by using a properly sorted path, or adding an
explicit sort.

We want to extend (1) to also look at ordering expected by the upper parts
of the plan, and consider incremental sort if applicable. (2) already does
that, and it already has the correct pathkeys to enforce.

But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-(

The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys
and group_pathkeys too, so maybe we should be looking at those too?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#140

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#139)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jul 8, 2019 at 10:58 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:

On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

We're running query like this:

SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3

but we're trying to add the incremental sort *before* the aggregation,
because the optimizer also considers group aggregate with a sorted
input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
actually can do this, but clearly that's nonsense, because we can't
possibly know the aggregates yet. Hence the error.

If this is the actual issue, we need to ensure we actually can evaluate
all the pathkeys. I don't know how to do that yet. I thought that maybe
we should modify pathkeys_common_contained_in() to set presorted_keys to
0 in this case.

But then I started wondering why we don't see this issue even for
regular (non-incremental-sort) paths built in create_ordered_paths().
How come we don't see these failures there? I've modified costing to
make all incremental sort paths very cheap, and still nothing.

I assume you mean you modified costing to make regular sort paths very cheap?

No, I mean costing of incremental sort paths, so that they end up being
the cheapest ones. If some other path is cheaper, we won't see the error
because it only happens when building plan from the cheapest path.

Ah, I misunderstood as you trying to figure out a way to try to cause
the same problem with a regular sort.

So presumably there's a check elsewhere (either implicit or explicit),
because create_ordered_paths() uses pathkeys_common_contained_in() and
does not have the same issue.

Given this comment in create_ordered_paths():

generate_gather_paths() will have already generated a simple Gather
path for the best parallel path, if any, and the loop above will have
considered sorting it. Similarly, generate_gather_paths() will also
have generated order-preserving Gather Merge plans which can be used
without sorting if they happen to match the sort_pathkeys, and the loop
above will have handled those as well. However, there's one more
possibility: it may make sense to sort the cheapest partial path
according to the required output order and then use Gather Merge.

my understanding is that generate_gather_paths() only considers paths
that already happen to be sorted (not explicit sorts), so I'm
wondering if it would make more sense for the incremental sort path
creation for this case to live alongside the explicit ordered path
creation in create_ordered_paths() rather than in
generate_gather_paths().

How would that solve the issue? Also, we're building a gather path, so
I think generate_gather_paths() is the right place where to do it. And
we're not changing the semantics of generate_gather_paths() - the result
path should be sorted "correctly" with respect to sort_pathkeys.

Does that imply what the explicit sort in create_ordered_paths() is in
the wrong spot?

I think those are essentially the right places where to do this sort of
stuff. Maybe there's a better place, but I don't think those places are
somehow wrong.

Or, to put it another way, do you think that both kinds of sorts
should be added in the same place? It seems confusing to me that
they'd be split between the two methods (unless I'm completely
misunderstanding how the two work).

The paths built in those two places were very different in one aspect:

1) generate_gather_paths only ever looked at pathkeys for the subpath, it
never even looked at ordering expected by paths above it (or the query as
a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
to maintain ordering of the different subpaths.

2) create_oredered_paths is meant to enforce ordering needed by upper
parts of the plan - either by using a properly sorted path, or adding an
explicit sort.

We want to extend (1) to also look at ordering expected by the upper parts
of the plan, and consider incremental sort if applicable. (2) already does
that, and it already has the correct pathkeys to enforce.

I guess I'm still not following. If (2) is responsible (currently) for
adding an explicit sort, why wouldn't adding an incremental sort be an
example of that responsibility? The subpath that either a Sort or
IncrementalSort is being added on top of (to then feed into the
GatherMerge) is the same in both cases right?

Unless you're saying that the difference is that since we have a
partial ordering already for incremental sort then incremental sort
falls into the category of "maintaining" existing ordering of the
subpath?

But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-(

The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys
and group_pathkeys too, so maybe we should be looking at those too?

I don't know enough yet to answer, but I'd like to look at (in the
debugger) the subpaths considered in each function to try to get a
better understanding of why we don't try to explicitly sort the aggs
(which we know we can't sort yet) but do for incremental sort. I
assume that means a subpath has to be present in one but not the other
since they both use the same pathkey checking function.

James Coleman

#141

Alexander Korotkov

a.korotkov@postgrespro.ru

over 6 years ago

In reply to: Tomas Vondra (#106)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jun 3, 2019 at 12:18 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

For a moment I thought we could/should look at the histogram, becase that
could tell us if there are groups "before" the first MCV one, but I don't
think we should do that, for two reasons. Firstly, rare values may not get
to the histogram anyway, which makes this rather unreliable and might
introduce sudden plan changes, because the cost would vary wildly
depending on whether we happened to sample the rare row or not. And
secondly, the rare row may be easily filtered out by a WHERE condition or
something, at which point we'll have to deal with the large group anyway.

If first MCV is in the middle of first histogram bin, then it's
reasonable to think that it would fit to first group. But if first
MCV is in the middle of histogram, such assumption would be
ridiculous. Also, I'd like to note that with our
default_statistics_target == 100, non MCV values are not so rare. So,
I'm +1 for taking histogram into account.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#142

Alexander Korotkov

a.korotkov@postgrespro.ru

over 6 years ago

In reply to: James Coleman (#129)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Jul 4, 2019 at 4:25 PM James Coleman <jtc331@gmail.com> wrote:

Process questions:
- Do I need to explicitly move the patch somehow to the next CF?

We didn't manage to register it on current (July) commitfest. So,
please, register it on next (September) commitfest.

- Since I've basically taken over patch ownership, should I move my
name from reviewer to author in the CF app? And can there be two
authors listed there?

Sure, you're co-author of this patch. Two or more authors could be
listed at CF app, you can find a lot of examples on the list.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#143

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#140)

4 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jul 08, 2019 at 12:07:06PM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 10:58 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:

On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

We're running query like this:

SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3

but we're trying to add the incremental sort *before* the aggregation,
because the optimizer also considers group aggregate with a sorted
input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
actually can do this, but clearly that's nonsense, because we can't
possibly know the aggregates yet. Hence the error.

If this is the actual issue, we need to ensure we actually can evaluate
all the pathkeys. I don't know how to do that yet. I thought that maybe
we should modify pathkeys_common_contained_in() to set presorted_keys to
0 in this case.

But then I started wondering why we don't see this issue even for
regular (non-incremental-sort) paths built in create_ordered_paths().
How come we don't see these failures there? I've modified costing to
make all incremental sort paths very cheap, and still nothing.

I assume you mean you modified costing to make regular sort paths very cheap?

No, I mean costing of incremental sort paths, so that they end up being
the cheapest ones. If some other path is cheaper, we won't see the error
because it only happens when building plan from the cheapest path.

Ah, I misunderstood as you trying to figure out a way to try to cause
the same problem with a regular sort.

So presumably there's a check elsewhere (either implicit or explicit),
because create_ordered_paths() uses pathkeys_common_contained_in() and
does not have the same issue.

Given this comment in create_ordered_paths():

generate_gather_paths() will have already generated a simple Gather
path for the best parallel path, if any, and the loop above will have
considered sorting it. Similarly, generate_gather_paths() will also
have generated order-preserving Gather Merge plans which can be used
without sorting if they happen to match the sort_pathkeys, and the loop
above will have handled those as well. However, there's one more
possibility: it may make sense to sort the cheapest partial path
according to the required output order and then use Gather Merge.

my understanding is that generate_gather_paths() only considers paths
that already happen to be sorted (not explicit sorts), so I'm
wondering if it would make more sense for the incremental sort path
creation for this case to live alongside the explicit ordered path
creation in create_ordered_paths() rather than in
generate_gather_paths().

How would that solve the issue? Also, we're building a gather path, so
I think generate_gather_paths() is the right place where to do it. And
we're not changing the semantics of generate_gather_paths() - the result
path should be sorted "correctly" with respect to sort_pathkeys.

Does that imply what the explicit sort in create_ordered_paths() is in
the wrong spot?

I think those are essentially the right places where to do this sort of
stuff. Maybe there's a better place, but I don't think those places are
somehow wrong.

Or, to put it another way, do you think that both kinds of sorts
should be added in the same place? It seems confusing to me that
they'd be split between the two methods (unless I'm completely
misunderstanding how the two work).

The paths built in those two places were very different in one aspect:

1) generate_gather_paths only ever looked at pathkeys for the subpath, it
never even looked at ordering expected by paths above it (or the query as
a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
to maintain ordering of the different subpaths.

2) create_oredered_paths is meant to enforce ordering needed by upper
parts of the plan - either by using a properly sorted path, or adding an
explicit sort.

We want to extend (1) to also look at ordering expected by the upper parts
of the plan, and consider incremental sort if applicable. (2) already does
that, and it already has the correct pathkeys to enforce.

I guess I'm still not following. If (2) is responsible (currently) for
adding an explicit sort, why wouldn't adding an incremental sort be an
example of that responsibility? The subpath that either a Sort or
IncrementalSort is being added on top of (to then feed into the
GatherMerge) is the same in both cases right?

Unless you're saying that the difference is that since we have a
partial ordering already for incremental sort then incremental sort
falls into the category of "maintaining" existing ordering of the
subpath?

Oh, I think I understand what you're saying. Essentially, we should not
be making generate_gather_paths responsible for adding the incremental
sort. Instead, we should be looking at places than are adding explicit
sort (using create_sort_path) and also consider adding incremental sort.

I definitely agree with the second half - we should look at all places
that create explicit sorts and make them also consider incremental
sorts. That makes sense.

But I'm not sure it'll address all cases - the problem is that those
places add the explicit sort because they need sorted input. Gather
Merge does not do that, it only preserves existing ordering of paths.

So it's possible the path does not have an explicit sort on to, and
gather merge will not know to add it. And once we have the gather merge
in place, we can't push place "under" it.

In fact, we already have code dealing with this "issue" for a special
case - see gather_grouping_paths(). It generates plain gather merge
paths, but then also considers building one with explicit sort. But it
only does that for grouping paths (when it's clear we need to be looking
at grouping_pathkeys), and there are generate_gather_paths() that don't
have similar treatment.

But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-(

The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys
and group_pathkeys too, so maybe we should be looking at those too?

I don't know enough yet to answer, but I'd like to look at (in the
debugger) the subpaths considered in each function to try to get a
better understanding of why we don't try to explicitly sort the aggs
(which we know we can't sort yet) but do for incremental sort. I
assume that means a subpath has to be present in one but not the other
since they both use the same pathkey checking function.

I've been wondering if we have some other code that needs to consider
interesting pathkeys "candidates" (instead of just getting the list
interesting in that place). Because then we could look at that code and
use it here ...

And guess what - postgres_fdw needs to do pretty much exactly that, when
building paths for remote relations. AFAIK we can't easily request all
plans from the remote node and then look at their pathkeys (like we'd do
with local node), so instead we deduce "interesting pathkeys" and then
request best plans for those. And deducing "interesing" pathkeys is
pretty much what get_useful_pathkeys_for_relation() is about.

So I've copied this function (and two more, called from it), whacked it
a bit until it removed (shakespeare-writing chimp comes to mind) and
voila, it seems to be working. The errors you reported are gone, and the
plans seems to be reasonable.

Attached is a sequence of 4 patches:

0001-fix-pathkey-processing-in-generate_gather_paths.patch
----------------------------------------------------------
This is the fixed version of my previous patch, with the stuff stolen
from postgres_fdw.

0002-fix-costing-in-cost_incremental_sort.patch
-----------------------------------------------
This is the costing fix, I mentioned before.

0003-fix-explain-in-parallel-mode.patch
---------------------------------------
Minor bug in explain, when incremental sort ends up being in the
parallel part of the plan (missing newline on per-worker line)

0004-rework-where-incremental-sort-paths-are-created.patch
----------------------------------------------------------
This undoes the generate_gather_paths() changes from 0001, and instead
modifies a bunch of places that call create_sort_path() to also consider
incremental sorts. There are a couple remaining, but those should not be
relevant to the queries I've been looking at.

Essentially, 0002 and 0003 are bugfixes. 0001 and 0004 are the two
different aproaches to building incremental sort + gather merge.

Now, consider this example:

create table t (a int, b int, c int);
insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
create index on t (a);
analyze t;
explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;

With 0001+0002+0003 pathes, I get a plan like this:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=10375.39..10594.72 rows=1 width=16)
-> Incremental Sort (cost=10375.39..2203675.71 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=10156.07..2203225.71 rows=10000 width=16)
Group Key: a, b
-> Gather Merge (cost=10156.07..2128124.39 rows=10000175 width=12)
Workers Planned: 2
-> Incremental Sort (cost=9156.04..972856.05 rows=4166740 width=12)
Sort Key: a, b
Presorted Key: a
-> Parallel Index Scan using t_a_idx on t (cost=0.43..417690.30 rows=4166740 width=12)
(12 rows)

and with 0004, I get this:

QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=20443.84..20665.32 rows=1 width=16)
-> Incremental Sort (cost=20443.84..2235250.05 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=20222.37..2234800.05 rows=10000 width=16)
Group Key: a, b
-> Incremental Sort (cost=20222.37..2159698.74 rows=10000175 width=12)
Sort Key: a, b
Presorted Key: a
-> Index Scan using t_a_idx on t (cost=0.43..476024.65 rows=10000175 width=12)
(10 rows)

Notice that cost of the second plan is almost double the first one. That
means 0004 does not even generate the first plan, i.e. there are cases
where we don't try to add the explicit sort before passing the path to
generate_gather_paths().

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.

FWIW tweaking all the create_sort_path() places to also consider adding
incremental sort is a bit tedious and invasive, and it almost doubles
the amount of repetitive code. It's OK for experiment like this, but we
should try handling this in a nicer way (move to a separate function
that does both, or something like that).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-fix-pathkey-processing-in-generate_gather_paths.patchtext/plain; charset=us-asciiDownload

From 56fc55058ca44f18a4a3c878a5588b01c67df0e0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2019 00:12:45 +0200
Subject: [PATCH 1/4] fix pathkey processing in generate_gather_paths

---
 src/backend/optimizer/path/allpaths.c   | 269 ++++++++++++++++++++++++
 src/backend/optimizer/plan/createplan.c |  10 +-
 2 files changed, 277 insertions(+), 2 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 3efc807164..34a0fb4d32 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2665,6 +2665,242 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	add_path(rel, create_worktablescan_path(root, rel, required_outer));
 }
 
+
+
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_ecs_for_relation
+ *		Determine which EquivalenceClasses might be involved in useful
+ *		orderings of this relation.
+ *
+ * This function is in some respects a mirror image of the core function
+ * pathkeys_useful_for_merging: for a regular table, we know what indexes
+ * we have and want to test whether any of them are useful.  For a foreign
+ * table, we don't know what indexes are present on the remote side but
+ * want to speculate about which ones we'd like to use if they existed.
+ *
+ * This function returns a list of potentially-useful equivalence classes,
+ * but it does not guarantee that an EquivalenceMember exists which contains
+ * Vars only from the given relation.  For example, given ft1 JOIN t1 ON
+ * ft1.x + t1.x = 0, this function will say that the equivalence class
+ * containing ft1.x + t1.x is potentially useful.  Supposing ft1 is remote and
+ * t1 is local (or on a different server), it will turn out that no useful
+ * ORDER BY clause can be generated.  It's not our job to figure that out
+ * here; we're only interested in identifying relevant ECs.
+ */
+static List *
+get_useful_ecs_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_eclass_list = NIL;
+	ListCell   *lc;
+	Relids		relids;
+
+	/*
+	 * First, consider whether any active EC is potentially useful for a merge
+	 * join against this relation.
+	 */
+	if (rel->has_eclass_joins)
+	{
+		foreach(lc, root->eq_classes)
+		{
+			EquivalenceClass *cur_ec = (EquivalenceClass *) lfirst(lc);
+
+			if (eclass_useful_for_merging(root, cur_ec, rel))
+				useful_eclass_list = lappend(useful_eclass_list, cur_ec);
+		}
+	}
+
+	/*
+	 * Next, consider whether there are any non-EC derivable join clauses that
+	 * are merge-joinable.  If the joininfo list is empty, we can exit
+	 * quickly.
+	 */
+	if (rel->joininfo == NIL)
+		return useful_eclass_list;
+
+	/* If this is a child rel, we must use the topmost parent rel to search. */
+	if (IS_OTHER_REL(rel))
+	{
+		Assert(!bms_is_empty(rel->top_parent_relids));
+		relids = rel->top_parent_relids;
+	}
+	else
+		relids = rel->relids;
+
+	/* Check each join clause in turn. */
+	foreach(lc, rel->joininfo)
+	{
+		RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(lc);
+
+		/* Consider only mergejoinable clauses */
+		if (restrictinfo->mergeopfamilies == NIL)
+			continue;
+
+		/* Make sure we've got canonical ECs. */
+		update_mergeclause_eclasses(root, restrictinfo);
+
+		/*
+		 * restrictinfo->mergeopfamilies != NIL is sufficient to guarantee
+		 * that left_ec and right_ec will be initialized, per comments in
+		 * distribute_qual_to_rels.
+		 *
+		 * We want to identify which side of this merge-joinable clause
+		 * contains columns from the relation produced by this RelOptInfo. We
+		 * test for overlap, not containment, because there could be extra
+		 * relations on either side.  For example, suppose we've got something
+		 * like ((A JOIN B ON A.x = B.x) JOIN C ON A.y = C.y) LEFT JOIN D ON
+		 * A.y = D.y.  The input rel might be the joinrel between A and B, and
+		 * we'll consider the join clause A.y = D.y. relids contains a
+		 * relation not involved in the join class (B) and the equivalence
+		 * class for the left-hand side of the clause contains a relation not
+		 * involved in the input rel (C).  Despite the fact that we have only
+		 * overlap and not containment in either direction, A.y is potentially
+		 * useful as a sort column.
+		 *
+		 * Note that it's even possible that relids overlaps neither side of
+		 * the join clause.  For example, consider A LEFT JOIN B ON A.x = B.x
+		 * AND A.x = 1.  The clause A.x = 1 will appear in B's joininfo list,
+		 * but overlaps neither side of B.  In that case, we just skip this
+		 * join clause, since it doesn't suggest a useful sort order for this
+		 * relation.
+		 */
+		if (bms_overlap(relids, restrictinfo->right_ec->ec_relids))
+			useful_eclass_list = list_append_unique_ptr(useful_eclass_list,
+														restrictinfo->right_ec);
+		else if (bms_overlap(relids, restrictinfo->left_ec->ec_relids))
+			useful_eclass_list = list_append_unique_ptr(useful_eclass_list,
+														restrictinfo->left_ec);
+	}
+
+	return useful_eclass_list;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	List	   *useful_eclass_list;
+	EquivalenceClass *query_ec = NULL;
+	ListCell   *lc;
+
+	/*
+	 * Pushing the query_pathkeys to the remote server is always worth
+	 * considering, because it might let us avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * The planner and executor don't have any clever strategy for
+			 * taking data sorted by a prefix of the query's pathkeys and
+			 * getting it to be sorted by all of those pathkeys. We'll just
+			 * end up resorting the entire data set.  So, unless we can push
+			 * down all of the query pathkeys, forget it.
+			 *
+			 * is_foreign_expr would detect volatile expressions as well, but
+			 * checking ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	/* Get the list of interesting EquivalenceClasses. */
+	useful_eclass_list = get_useful_ecs_for_relation(root, rel);
+
+	/* Extract unique EC for query, if any, so we don't consider it again. */
+	if (list_length(root->query_pathkeys) == 1)
+	{
+		PathKey    *query_pathkey = linitial(root->query_pathkeys);
+
+		query_ec = query_pathkey->pk_eclass;
+	}
+
+	/*
+	 * As a heuristic, the only pathkeys we consider here are those of length
+	 * one.  It's surely possible to consider more, but since each one we
+	 * choose to consider will generate a round-trip to the remote side, we
+	 * need to be a bit cautious here.  It would sure be nice to have a local
+	 * cache of information about remote index definitions...
+	 */
+	foreach(lc, useful_eclass_list)
+	{
+		EquivalenceClass *cur_ec = lfirst(lc);
+		Expr	   *em_expr;
+		PathKey    *pathkey;
+
+		/* If redundant with what we did above, skip it. */
+		if (cur_ec == query_ec)
+			continue;
+
+		/* If no pushable expression for this rel, skip it. */
+		em_expr = find_em_expr_for_rel(cur_ec, rel);
+		if (em_expr == NULL)
+			continue;
+
+		/* Looks like we can generate a pathkey, so let's do it. */
+		pathkey = make_canonical_pathkey(root, cur_ec,
+										 linitial_oid(cur_ec->ec_opfamilies),
+										 BTLessStrategyNumber,
+										 false);
+		useful_pathkeys_list = lappend(useful_pathkeys_list,
+									   list_make1(pathkey));
+	}
+
+	return useful_pathkeys_list;
+}
+
 /*
  * generate_gather_paths
  *		Generate parallel access paths for a relation by pushing a Gather or
@@ -2719,6 +2955,10 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	{
 		Path	   *subpath = (Path *) lfirst(lc);
 		GatherMergePath *path;
+		bool		is_sorted;
+		int			presorted_keys;
+		List	   *useful_pathkeys_list = NIL; /* List of all pathkeys */
+		ListCell   *lc;
 
 		if (subpath->pathkeys == NIL)
 			continue;
@@ -2727,6 +2967,35 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 		path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
 										subpath->pathkeys, NULL, rowsp);
 		add_path(rel, &path->path);
+
+		/* consider incremental sort for interesting orderings */
+		useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+		foreach(lc, useful_pathkeys_list)
+		{
+			List	   *useful_pathkeys = lfirst(lc);
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			if (!is_sorted && (presorted_keys > 0))
+			{
+				/* Also consider incremental sort. */
+				subpath = (Path *) create_incremental_sort_path(root,
+																rel,
+																subpath,
+																useful_pathkeys,
+																presorted_keys,
+																-1);
+
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+
 	}
 }
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index bfb52f21ab..c2877942cb 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -5932,7 +5932,10 @@ prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 				}
 			}
 			if (!j)
-				elog(ERROR, "could not find pathkey item to sort");
+			{
+				elog(WARNING, "could not find pathkey item to sort");
+				Assert(false);
+			}
 
 			/*
 			 * Do we need to insert a Result node?
@@ -6491,7 +6494,10 @@ make_unique_from_pathkeys(Plan *lefttree, List *pathkeys, int numCols)
 		}
 
 		if (!tle)
-			elog(ERROR, "could not find pathkey item to sort");
+		{
+			elog(WARNING, "could not find pathkey item to sort");
+			Assert(false);
+		}
 
 		/*
 		 * Look up the correct equality operator from the PathKey's slightly
-- 
2.20.1

0002-fix-costing-in-cost_incremental_sort.patchtext/plain; charset=us-asciiDownload

From e6894cf4dd0c7c6db0fe11089fff4cf05794b97f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2019 00:13:04 +0200
Subject: [PATCH 2/4] fix costing in cost_incremental_sort

---
 src/backend/optimizer/path/costsize.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7f820e7351..c6aa17ba67 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1875,16 +1875,8 @@ cost_incremental_sort(Path *path,
 				   limit_tuples);
 
 	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
-	if (limit_tuples > 0 && limit_tuples < input_tuples)
-	{
-		output_tuples = limit_tuples;
-		output_groups = floor(output_tuples / group_tuples) + 1;
-	}
-	else
-	{
-		output_tuples = input_tuples;
-		output_groups = input_groups;
-	}
+	output_tuples = input_tuples;
+	output_groups = input_groups;
 
 	/*
 	 * Startup cost of incremental sort is the startup cost of its first group
-- 
2.20.1

0003-fix-explain-in-parallel-mode.patchtext/plain; charset=us-asciiDownload

From 3ab9a075e7569abda2cf717e06e27d0ec258b59c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2019 00:13:34 +0200
Subject: [PATCH 3/4] fix explain in parallel mode

---
 src/backend/commands/explain.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d3f855a12a..925e8236ba 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2775,7 +2775,7 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 								 fullsort_spaceUsed, fullsort_group_count);
 				if (prefixsort_instrument)
 					appendStringInfo(es->str,
-									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld",
+									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
 									 prefixsort_sortMethod, prefixsort_spaceType,
 									 prefixsort_spaceUsed, prefixsort_group_count);
 				else
-- 
2.20.1

0004-rework-where-incremental-sort-paths-are-created.patchtext/plain; charset=us-asciiDownload

From 091627c63cfb7ab47bfb76f6a96f94370aeea28d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2019 02:14:18 +0200
Subject: [PATCH 4/4] rework where incremental sort paths are created

---
 src/backend/optimizer/path/allpaths.c | 269 -----------------------
 src/backend/optimizer/plan/planner.c  | 299 ++++++++++++++++++++++++++
 2 files changed, 299 insertions(+), 269 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 34a0fb4d32..3efc807164 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2665,242 +2665,6 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	add_path(rel, create_worktablescan_path(root, rel, required_outer));
 }
 
-
-
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-static Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
-/*
- * get_useful_ecs_for_relation
- *		Determine which EquivalenceClasses might be involved in useful
- *		orderings of this relation.
- *
- * This function is in some respects a mirror image of the core function
- * pathkeys_useful_for_merging: for a regular table, we know what indexes
- * we have and want to test whether any of them are useful.  For a foreign
- * table, we don't know what indexes are present on the remote side but
- * want to speculate about which ones we'd like to use if they existed.
- *
- * This function returns a list of potentially-useful equivalence classes,
- * but it does not guarantee that an EquivalenceMember exists which contains
- * Vars only from the given relation.  For example, given ft1 JOIN t1 ON
- * ft1.x + t1.x = 0, this function will say that the equivalence class
- * containing ft1.x + t1.x is potentially useful.  Supposing ft1 is remote and
- * t1 is local (or on a different server), it will turn out that no useful
- * ORDER BY clause can be generated.  It's not our job to figure that out
- * here; we're only interested in identifying relevant ECs.
- */
-static List *
-get_useful_ecs_for_relation(PlannerInfo *root, RelOptInfo *rel)
-{
-	List	   *useful_eclass_list = NIL;
-	ListCell   *lc;
-	Relids		relids;
-
-	/*
-	 * First, consider whether any active EC is potentially useful for a merge
-	 * join against this relation.
-	 */
-	if (rel->has_eclass_joins)
-	{
-		foreach(lc, root->eq_classes)
-		{
-			EquivalenceClass *cur_ec = (EquivalenceClass *) lfirst(lc);
-
-			if (eclass_useful_for_merging(root, cur_ec, rel))
-				useful_eclass_list = lappend(useful_eclass_list, cur_ec);
-		}
-	}
-
-	/*
-	 * Next, consider whether there are any non-EC derivable join clauses that
-	 * are merge-joinable.  If the joininfo list is empty, we can exit
-	 * quickly.
-	 */
-	if (rel->joininfo == NIL)
-		return useful_eclass_list;
-
-	/* If this is a child rel, we must use the topmost parent rel to search. */
-	if (IS_OTHER_REL(rel))
-	{
-		Assert(!bms_is_empty(rel->top_parent_relids));
-		relids = rel->top_parent_relids;
-	}
-	else
-		relids = rel->relids;
-
-	/* Check each join clause in turn. */
-	foreach(lc, rel->joininfo)
-	{
-		RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(lc);
-
-		/* Consider only mergejoinable clauses */
-		if (restrictinfo->mergeopfamilies == NIL)
-			continue;
-
-		/* Make sure we've got canonical ECs. */
-		update_mergeclause_eclasses(root, restrictinfo);
-
-		/*
-		 * restrictinfo->mergeopfamilies != NIL is sufficient to guarantee
-		 * that left_ec and right_ec will be initialized, per comments in
-		 * distribute_qual_to_rels.
-		 *
-		 * We want to identify which side of this merge-joinable clause
-		 * contains columns from the relation produced by this RelOptInfo. We
-		 * test for overlap, not containment, because there could be extra
-		 * relations on either side.  For example, suppose we've got something
-		 * like ((A JOIN B ON A.x = B.x) JOIN C ON A.y = C.y) LEFT JOIN D ON
-		 * A.y = D.y.  The input rel might be the joinrel between A and B, and
-		 * we'll consider the join clause A.y = D.y. relids contains a
-		 * relation not involved in the join class (B) and the equivalence
-		 * class for the left-hand side of the clause contains a relation not
-		 * involved in the input rel (C).  Despite the fact that we have only
-		 * overlap and not containment in either direction, A.y is potentially
-		 * useful as a sort column.
-		 *
-		 * Note that it's even possible that relids overlaps neither side of
-		 * the join clause.  For example, consider A LEFT JOIN B ON A.x = B.x
-		 * AND A.x = 1.  The clause A.x = 1 will appear in B's joininfo list,
-		 * but overlaps neither side of B.  In that case, we just skip this
-		 * join clause, since it doesn't suggest a useful sort order for this
-		 * relation.
-		 */
-		if (bms_overlap(relids, restrictinfo->right_ec->ec_relids))
-			useful_eclass_list = list_append_unique_ptr(useful_eclass_list,
-														restrictinfo->right_ec);
-		else if (bms_overlap(relids, restrictinfo->left_ec->ec_relids))
-			useful_eclass_list = list_append_unique_ptr(useful_eclass_list,
-														restrictinfo->left_ec);
-	}
-
-	return useful_eclass_list;
-}
-
-/*
- * get_useful_pathkeys_for_relation
- *		Determine which orderings of a relation might be useful.
- *
- * Getting data in sorted order can be useful either because the requested
- * order matches the final output ordering for the overall query we're
- * planning, or because it enables an efficient merge join.  Here, we try
- * to figure out which pathkeys to consider.
- */
-static List *
-get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
-{
-	List	   *useful_pathkeys_list = NIL;
-	List	   *useful_eclass_list;
-	EquivalenceClass *query_ec = NULL;
-	ListCell   *lc;
-
-	/*
-	 * Pushing the query_pathkeys to the remote server is always worth
-	 * considering, because it might let us avoid a local sort.
-	 */
-	if (root->query_pathkeys)
-	{
-		bool		query_pathkeys_ok = true;
-
-		foreach(lc, root->query_pathkeys)
-		{
-			PathKey    *pathkey = (PathKey *) lfirst(lc);
-			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
-			Expr	   *em_expr;
-
-			/*
-			 * The planner and executor don't have any clever strategy for
-			 * taking data sorted by a prefix of the query's pathkeys and
-			 * getting it to be sorted by all of those pathkeys. We'll just
-			 * end up resorting the entire data set.  So, unless we can push
-			 * down all of the query pathkeys, forget it.
-			 *
-			 * is_foreign_expr would detect volatile expressions as well, but
-			 * checking ec_has_volatile here saves some cycles.
-			 */
-			if (pathkey_ec->ec_has_volatile ||
-				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
-			{
-				query_pathkeys_ok = false;
-				break;
-			}
-		}
-
-		if (query_pathkeys_ok)
-			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
-	}
-
-	/* Get the list of interesting EquivalenceClasses. */
-	useful_eclass_list = get_useful_ecs_for_relation(root, rel);
-
-	/* Extract unique EC for query, if any, so we don't consider it again. */
-	if (list_length(root->query_pathkeys) == 1)
-	{
-		PathKey    *query_pathkey = linitial(root->query_pathkeys);
-
-		query_ec = query_pathkey->pk_eclass;
-	}
-
-	/*
-	 * As a heuristic, the only pathkeys we consider here are those of length
-	 * one.  It's surely possible to consider more, but since each one we
-	 * choose to consider will generate a round-trip to the remote side, we
-	 * need to be a bit cautious here.  It would sure be nice to have a local
-	 * cache of information about remote index definitions...
-	 */
-	foreach(lc, useful_eclass_list)
-	{
-		EquivalenceClass *cur_ec = lfirst(lc);
-		Expr	   *em_expr;
-		PathKey    *pathkey;
-
-		/* If redundant with what we did above, skip it. */
-		if (cur_ec == query_ec)
-			continue;
-
-		/* If no pushable expression for this rel, skip it. */
-		em_expr = find_em_expr_for_rel(cur_ec, rel);
-		if (em_expr == NULL)
-			continue;
-
-		/* Looks like we can generate a pathkey, so let's do it. */
-		pathkey = make_canonical_pathkey(root, cur_ec,
-										 linitial_oid(cur_ec->ec_opfamilies),
-										 BTLessStrategyNumber,
-										 false);
-		useful_pathkeys_list = lappend(useful_pathkeys_list,
-									   list_make1(pathkey));
-	}
-
-	return useful_pathkeys_list;
-}
-
 /*
  * generate_gather_paths
  *		Generate parallel access paths for a relation by pushing a Gather or
@@ -2955,10 +2719,6 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	{
 		Path	   *subpath = (Path *) lfirst(lc);
 		GatherMergePath *path;
-		bool		is_sorted;
-		int			presorted_keys;
-		List	   *useful_pathkeys_list = NIL; /* List of all pathkeys */
-		ListCell   *lc;
 
 		if (subpath->pathkeys == NIL)
 			continue;
@@ -2967,35 +2727,6 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 		path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
 										subpath->pathkeys, NULL, rowsp);
 		add_path(rel, &path->path);
-
-		/* consider incremental sort for interesting orderings */
-		useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
-
-		foreach(lc, useful_pathkeys_list)
-		{
-			List	   *useful_pathkeys = lfirst(lc);
-
-			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
-													 subpath->pathkeys,
-													 &presorted_keys);
-
-			if (!is_sorted && (presorted_keys > 0))
-			{
-				/* Also consider incremental sort. */
-				subpath = (Path *) create_incremental_sort_path(root,
-																rel,
-																subpath,
-																useful_pathkeys,
-																presorted_keys,
-																-1);
-
-				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
-												subpath->pathkeys, NULL, rowsp);
-
-				add_path(rel, &path->path);
-			}
-		}
-
 	}
 }
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 16996b1bc2..ecad427c40 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5068,6 +5068,48 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/* also consider incremental sorts on all partial paths */
+		{
+			ListCell *lc;
+			foreach (lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				/* already handled above */
+				if (input_path == cheapest_partial_path)
+					continue;
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys, &presorted_keys);
+
+				/* also ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys > 0)
+				{
+					/* Also consider incremental sort. */
+					sorted_path = (Path *) create_incremental_sort_path(root,
+																		ordered_rel,
+																		input_path,
+																		root->sort_pathkeys,
+																		presorted_keys,
+																		limit_tuples);
+
+					/* Add projection step if needed */
+					if (sorted_path->pathtarget != target)
+						sorted_path = apply_projection_to_path(root, ordered_rel,
+															   sorted_path, target);
+
+					add_path(ordered_rel, sorted_path);
+				}
+			}
+
+		}
 	}
 
 	/*
@@ -6484,6 +6526,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			}
 		}
 
+
+		/*
+		 * Use any available suitably-sorted path as input, with incremental
+		 * sort path.
+		 */
+		foreach(lc, input_rel->pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+				continue;
+
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
+		}
+
 		/*
 		 * Instead of operating directly on the input relation, we can
 		 * consider finalizing a partially aggregated path.
@@ -6530,6 +6646,53 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   havingQual,
 											   dNumGroups));
 			}
+
+			/* incremental sort */
+			foreach(lc, partially_grouped_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
+			}
+
 		}
 	}
 
@@ -6798,6 +6961,57 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Use any available suitably-sorted path as input, and also consider
+		 * sorting the cheapest partial path.
+		 */
+		foreach(lc, input_rel->pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* also ignore already sorted paths */
+			if (is_sorted)
+				continue;
+
+			if (presorted_keys == 0)
+				continue;
+
+			/* add incremental sort */
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_path(partially_grouped_rel, (Path *)
+						 create_agg_path(root,
+										 partially_grouped_rel,
+										 path,
+										 partially_grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_INITIAL_SERIAL,
+										 parse->groupClause,
+										 NIL,
+										 agg_partial_costs,
+										 dNumPartialGroups));
+			else
+				add_path(partially_grouped_rel, (Path *)
+						 create_group_path(root,
+										   partially_grouped_rel,
+										   path,
+										   parse->groupClause,
+										   NIL,
+										   dNumPartialGroups));
+		}
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6842,6 +7056,52 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   dNumPartialPartialGroups));
 			}
 		}
+
+		/* consider incremental sort */
+		foreach(lc, input_rel->partial_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+				continue;
+
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
+		}
 	}
 
 	if (can_hash && cheapest_total_path != NULL)
@@ -6938,6 +7198,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
@@ -6967,6 +7228,44 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	/* also consider incremental sort on all partial paths */
+	foreach (lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
+
 }
 
 /*
-- 
2.20.1

#144

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: James Coleman (#140)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Note: As I was writing this, I saw a new email come in from Tomas with
a new patch series, and some similar observations. I'll look at that
patch series more, but I think it's likely far more complete, so will
end up going with that. I wanted to send this email anyway to at least
capture the debugging process for reference.

On Mon, Jul 8, 2019 at 12:07 PM James Coleman <jtc331@gmail.com> wrote:

On Mon, Jul 8, 2019 at 10:58 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:

On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

We're running query like this:

SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3

but we're trying to add the incremental sort *before* the aggregation,
because the optimizer also considers group aggregate with a sorted
input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
actually can do this, but clearly that's nonsense, because we can't
possibly know the aggregates yet. Hence the error.

If this is the actual issue, we need to ensure we actually can evaluate
all the pathkeys. I don't know how to do that yet. I thought that maybe
we should modify pathkeys_common_contained_in() to set presorted_keys to
0 in this case.

But then I started wondering why we don't see this issue even for
regular (non-incremental-sort) paths built in create_ordered_paths().
How come we don't see these failures there? I've modified costing to
make all incremental sort paths very cheap, and still nothing.

I assume you mean you modified costing to make regular sort paths very cheap?

No, I mean costing of incremental sort paths, so that they end up being
the cheapest ones. If some other path is cheaper, we won't see the error
because it only happens when building plan from the cheapest path.

Ah, I misunderstood as you trying to figure out a way to try to cause
the same problem with a regular sort.

So presumably there's a check elsewhere (either implicit or explicit),
because create_ordered_paths() uses pathkeys_common_contained_in() and
does not have the same issue.

Given this comment in create_ordered_paths():

generate_gather_paths() will have already generated a simple Gather
path for the best parallel path, if any, and the loop above will have
considered sorting it. Similarly, generate_gather_paths() will also
have generated order-preserving Gather Merge plans which can be used
without sorting if they happen to match the sort_pathkeys, and the loop
above will have handled those as well. However, there's one more
possibility: it may make sense to sort the cheapest partial path
according to the required output order and then use Gather Merge.

my understanding is that generate_gather_paths() only considers paths
that already happen to be sorted (not explicit sorts), so I'm
wondering if it would make more sense for the incremental sort path
creation for this case to live alongside the explicit ordered path
creation in create_ordered_paths() rather than in
generate_gather_paths().

How would that solve the issue? Also, we're building a gather path, so
I think generate_gather_paths() is the right place where to do it. And
we're not changing the semantics of generate_gather_paths() - the result
path should be sorted "correctly" with respect to sort_pathkeys.

Does that imply what the explicit sort in create_ordered_paths() is in
the wrong spot?

I think those are essentially the right places where to do this sort of
stuff. Maybe there's a better place, but I don't think those places are
somehow wrong.

Or, to put it another way, do you think that both kinds of sorts
should be added in the same place? It seems confusing to me that
they'd be split between the two methods (unless I'm completely
misunderstanding how the two work).

The paths built in those two places were very different in one aspect:

1) generate_gather_paths only ever looked at pathkeys for the subpath, it
never even looked at ordering expected by paths above it (or the query as
a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
to maintain ordering of the different subpaths.

2) create_oredered_paths is meant to enforce ordering needed by upper
parts of the plan - either by using a properly sorted path, or adding an
explicit sort.

We want to extend (1) to also look at ordering expected by the upper parts
of the plan, and consider incremental sort if applicable. (2) already does
that, and it already has the correct pathkeys to enforce.

I guess I'm still not following. If (2) is responsible (currently) for
adding an explicit sort, why wouldn't adding an incremental sort be an
example of that responsibility? The subpath that either a Sort or
IncrementalSort is being added on top of (to then feed into the
GatherMerge) is the same in both cases right?

Unless you're saying that the difference is that since we have a
partial ordering already for incremental sort then incremental sort
falls into the category of "maintaining" existing ordering of the
subpath?

To try to understand this better I looked through all usages of
generate_gather_paths(). The usage in gather_grouping_paths() also
treats explicit sorting as its responsibility rather than the
responsibility of generate_gather_paths(). Since I consider
incremental sort to be just another form of explicit sorting, I think
it's reasonable to make it the responsibility of callers, given that
currently generate_gather_paths() currently seems to be explicitly
only about fully presorted paths.

Of course that might not be ideal the more callers that have to handle
it specially. So maybe it's worth shifting responsibility.

But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-(

The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys
and group_pathkeys too, so maybe we should be looking at those too?

I don't know enough yet to answer, but I'd like to look at (in the
debugger) the subpaths considered in each function to try to get a
better understanding of why we don't try to explicitly sort the aggs
(which we know we can't sort yet) but do for incremental sort. I
assume that means a subpath has to be present in one but not the other
since they both use the same pathkey checking function.

I stepped through one of the failing test cases and found that in
create_ordered_paths() the if expression fails because
input_rel->partial_pathlist == NIL. So apparently that is the guard
against incorrectly adding as-yet unsortable paths below the
aggregate. So if I move the create_incremental_sort_path() call inside
this block the error goes away.

Attached is patch revision with that changes that.

James Coleman

Attachments:

parallel-incremental-sort-v3.patchtext/x-patch; charset=US-ASCII; name=parallel-incremental-sort-v3.patchDownload

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7f820e7351..c6aa17ba67 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1875,16 +1875,8 @@ cost_incremental_sort(Path *path,
 				   limit_tuples);
 
 	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
-	if (limit_tuples > 0 && limit_tuples < input_tuples)
-	{
-		output_tuples = limit_tuples;
-		output_groups = floor(output_tuples / group_tuples) + 1;
-	}
-	else
-	{
-		output_tuples = input_tuples;
-		output_groups = input_groups;
-	}
+	output_tuples = input_tuples;
+	output_groups = input_groups;
 
 	/*
 	 * Startup cost of incremental sort is the startup cost of its first group
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 16996b1bc2..cb709b8764 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5033,15 +5033,20 @@ create_ordered_paths(PlannerInfo *root,
 		input_rel->partial_pathlist != NIL)
 	{
 		Path	   *cheapest_partial_path;
+		bool		is_sorted;
+		int			presorted_keys;
 
 		cheapest_partial_path = linitial(input_rel->partial_pathlist);
 
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 cheapest_partial_path->pathkeys,
+												 &presorted_keys);
+
 		/*
 		 * If cheapest partial path doesn't need a sort, this is redundant
 		 * with what's already been tried.
 		 */
-		if (!pathkeys_contained_in(root->sort_pathkeys,
-								   cheapest_partial_path->pathkeys))
+		if (!is_sorted)
 		{
 			Path	   *path;
 			double		total_groups;
@@ -5067,6 +5072,34 @@ create_ordered_paths(PlannerInfo *root,
 												path, target);
 
 			add_path(ordered_rel, path);
+
+			/*
+			 * If already partially sorted then we should also consider
+			 * incremental sort.
+			 */
+			if (presorted_keys > 0)
+			{
+				path = (Path *) create_incremental_sort_path(root,
+															 ordered_rel,
+															 cheapest_partial_path,
+															 root->sort_pathkeys,
+															 presorted_keys,
+															 limit_tuples);
+
+				path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 path,
+											 path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (path->pathtarget != target)
+					path = apply_projection_to_path(root, ordered_rel,
+													path, target);
+
+				add_path(ordered_rel, path);
+			}
 		}
 	}
 
@@ -6939,14 +6972,20 @@ static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
 	Path	   *cheapest_partial_path;
+	bool		is_sorted;
+	int			presorted_keys;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
 	generate_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
-	if (!pathkeys_contained_in(root->group_pathkeys,
-							   cheapest_partial_path->pathkeys))
+
+	is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+											 cheapest_partial_path->pathkeys,
+											 &presorted_keys);
+
+	if (!is_sorted)
 	{
 		Path	   *path;
 		double		total_groups;
@@ -6966,6 +7005,31 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 									 &total_groups);
 
 		add_path(rel, path);
+
+		/*
+		 * If already partially sorted then we should also consider
+		 * incremental sort.
+		 */
+		if (presorted_keys > 0)
+		{
+			path = (Path *) create_incremental_sort_path(root,
+														 rel,
+														 cheapest_partial_path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			path = (Path *)
+				create_gather_merge_path(root,
+										 rel,
+										 path,
+										 rel->reltarget,
+										 root->group_pathkeys,
+										 NULL,
+										 &total_groups);
+
+			add_path(rel, path);
+		}
 	}
 }

#145

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Tomas Vondra (#143)

4 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jul 09, 2019 at 03:37:03AM +0200, Tomas Vondra wrote:

...

Notice that cost of the second plan is almost double the first one. That
means 0004 does not even generate the first plan, i.e. there are cases
where we don't try to add the explicit sort before passing the path to
generate_gather_paths().

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.

I've looked into this again, and yes - that's the reason. I've added
generate_useful_gather_paths() which is a wrapper on top of
generate_gather_paths(). It essentially does what 0001 patch did directly
in generate_gather_paths() so it's more like create_grouping_paths().

And that does the trick - we now generate the cheaper paths, and I don't
see any crashes in regression tests etc.

I still suspect we already have code doing similar checks whether pathkeys
might be useful somewhere. I've looked into pathkeys.c and elsewhere but
no luck.

Attached is a slightly modified patch series:

1) 0001 considers incremental sort paths in various places (adds the new
generate_useful_gather_paths and modifies places calling create_sort_path)

2) 0002 and 0003 are fixes I mentioned before

3) 0004 adds a new GUC force_incremental_sort that (when set to 'on')
tries to nudge the optimizer into using incremental sort by essentially
making it free (i.e. using startup/total costs of the subpath). I've found
this useful when trying to force incremental sorts into plans where it may
not be the best strategy.

I won't have time to hack on this over the next ~2 weeks, but I'll try to
respond to questions when possible.

FWIW tweaking all the create_sort_path() places to also consider adding
incremental sort is a bit tedious and invasive, and it almost doubles
the amount of repetitive code. It's OK for experiment like this, but we
should try handling this in a nicer way (move to a separate function
that does both, or something like that).

This definitely needs more work. We need to refactor it in some way, e.g.
have a function that would consider both explicit sort (on the cheapest
path) and incremental sort (on all paths), and call it from all those
places. Haven't tried it, though.

There's also a couple more places where we do create_sort_path() and don't
consider incremental sort yet - window functions, distinct etc.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-fix-pathkey-processing-in-generate_gather_p-20190709.patchtext/plain; charset=us-asciiDownload

From e7e97daf447a91e090809be4f07a5eee650eb5e7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2019 00:12:45 +0200
Subject: [PATCH 1/4] fix pathkey processing in generate_gather_paths

---
 src/backend/optimizer/path/allpaths.c   | 365 +++++++++++++++++++++++-
 src/backend/optimizer/plan/createplan.c |  10 +-
 src/backend/optimizer/plan/planner.c    | 301 ++++++++++++++++++-
 src/include/optimizer/paths.h           |   2 +
 4 files changed, 673 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 3efc807164..acddbef064 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2730,6 +2730,367 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_ecs_for_relation
+ *		Determine which EquivalenceClasses might be involved in useful
+ *		orderings of this relation.
+ *
+ * This function is in some respects a mirror image of the core function
+ * pathkeys_useful_for_merging: for a regular table, we know what indexes
+ * we have and want to test whether any of them are useful.  For a foreign
+ * table, we don't know what indexes are present on the remote side but
+ * want to speculate about which ones we'd like to use if they existed.
+ *
+ * This function returns a list of potentially-useful equivalence classes,
+ * but it does not guarantee that an EquivalenceMember exists which contains
+ * Vars only from the given relation.  For example, given ft1 JOIN t1 ON
+ * ft1.x + t1.x = 0, this function will say that the equivalence class
+ * containing ft1.x + t1.x is potentially useful.  Supposing ft1 is remote and
+ * t1 is local (or on a different server), it will turn out that no useful
+ * ORDER BY clause can be generated.  It's not our job to figure that out
+ * here; we're only interested in identifying relevant ECs.
+ */
+static List *
+get_useful_ecs_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_eclass_list = NIL;
+	ListCell   *lc;
+	Relids		relids;
+
+	/*
+	 * First, consider whether any active EC is potentially useful for a merge
+	 * join against this relation.
+	 */
+	if (rel->has_eclass_joins)
+	{
+		foreach(lc, root->eq_classes)
+		{
+			EquivalenceClass *cur_ec = (EquivalenceClass *) lfirst(lc);
+
+			if (eclass_useful_for_merging(root, cur_ec, rel))
+				useful_eclass_list = lappend(useful_eclass_list, cur_ec);
+		}
+	}
+
+	/*
+	 * Next, consider whether there are any non-EC derivable join clauses that
+	 * are merge-joinable.  If the joininfo list is empty, we can exit
+	 * quickly.
+	 */
+	if (rel->joininfo == NIL)
+		return useful_eclass_list;
+
+	/* If this is a child rel, we must use the topmost parent rel to search. */
+	if (IS_OTHER_REL(rel))
+	{
+		Assert(!bms_is_empty(rel->top_parent_relids));
+		relids = rel->top_parent_relids;
+	}
+	else
+		relids = rel->relids;
+
+	/* Check each join clause in turn. */
+	foreach(lc, rel->joininfo)
+	{
+		RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(lc);
+
+		/* Consider only mergejoinable clauses */
+		if (restrictinfo->mergeopfamilies == NIL)
+			continue;
+
+		/* Make sure we've got canonical ECs. */
+		update_mergeclause_eclasses(root, restrictinfo);
+
+		/*
+		 * restrictinfo->mergeopfamilies != NIL is sufficient to guarantee
+		 * that left_ec and right_ec will be initialized, per comments in
+		 * distribute_qual_to_rels.
+		 *
+		 * We want to identify which side of this merge-joinable clause
+		 * contains columns from the relation produced by this RelOptInfo. We
+		 * test for overlap, not containment, because there could be extra
+		 * relations on either side.  For example, suppose we've got something
+		 * like ((A JOIN B ON A.x = B.x) JOIN C ON A.y = C.y) LEFT JOIN D ON
+		 * A.y = D.y.  The input rel might be the joinrel between A and B, and
+		 * we'll consider the join clause A.y = D.y. relids contains a
+		 * relation not involved in the join class (B) and the equivalence
+		 * class for the left-hand side of the clause contains a relation not
+		 * involved in the input rel (C).  Despite the fact that we have only
+		 * overlap and not containment in either direction, A.y is potentially
+		 * useful as a sort column.
+		 *
+		 * Note that it's even possible that relids overlaps neither side of
+		 * the join clause.  For example, consider A LEFT JOIN B ON A.x = B.x
+		 * AND A.x = 1.  The clause A.x = 1 will appear in B's joininfo list,
+		 * but overlaps neither side of B.  In that case, we just skip this
+		 * join clause, since it doesn't suggest a useful sort order for this
+		 * relation.
+		 */
+		if (bms_overlap(relids, restrictinfo->right_ec->ec_relids))
+			useful_eclass_list = list_append_unique_ptr(useful_eclass_list,
+														restrictinfo->right_ec);
+		else if (bms_overlap(relids, restrictinfo->left_ec->ec_relids))
+			useful_eclass_list = list_append_unique_ptr(useful_eclass_list,
+														restrictinfo->left_ec);
+	}
+
+	return useful_eclass_list;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	List	   *useful_eclass_list;
+	EquivalenceClass *query_ec = NULL;
+	ListCell   *lc;
+
+	/*
+	 * Pushing the query_pathkeys to the remote server is always worth
+	 * considering, because it might let us avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * The planner and executor don't have any clever strategy for
+			 * taking data sorted by a prefix of the query's pathkeys and
+			 * getting it to be sorted by all of those pathkeys. We'll just
+			 * end up resorting the entire data set.  So, unless we can push
+			 * down all of the query pathkeys, forget it.
+			 *
+			 * is_foreign_expr would detect volatile expressions as well, but
+			 * checking ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	/* Get the list of interesting EquivalenceClasses. */
+	useful_eclass_list = get_useful_ecs_for_relation(root, rel);
+
+	/* Extract unique EC for query, if any, so we don't consider it again. */
+	if (list_length(root->query_pathkeys) == 1)
+	{
+		PathKey    *query_pathkey = linitial(root->query_pathkeys);
+
+		query_ec = query_pathkey->pk_eclass;
+	}
+
+	/*
+	 * As a heuristic, the only pathkeys we consider here are those of length
+	 * one.  It's surely possible to consider more, but since each one we
+	 * choose to consider will generate a round-trip to the remote side, we
+	 * need to be a bit cautious here.  It would sure be nice to have a local
+	 * cache of information about remote index definitions...
+	 */
+	foreach(lc, useful_eclass_list)
+	{
+		EquivalenceClass *cur_ec = lfirst(lc);
+		Expr	   *em_expr;
+		PathKey    *pathkey;
+
+		/* If redundant with what we did above, skip it. */
+		if (cur_ec == query_ec)
+			continue;
+
+		/* If no pushable expression for this rel, skip it. */
+		em_expr = find_em_expr_for_rel(cur_ec, rel);
+		if (em_expr == NULL)
+			continue;
+
+		/* Looks like we can generate a pathkey, so let's do it. */
+		pathkey = make_canonical_pathkey(root, cur_ec,
+										 linitial_oid(cur_ec->ec_opfamilies),
+										 BTLessStrategyNumber,
+										 false);
+		useful_pathkeys_list = lappend(useful_pathkeys_list,
+									   list_make1(pathkey));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike generate_gather_paths, this does not look just as pathkeys of the
+ * input paths (aiming to preserve the ordering). It also considers ordering
+ * that might be useful by nodes above the gather merge node, and tries to
+ * add a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather merge paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			/* now we know is_sorted == false */
+
+			/*
+			 * consider regular sort for cheapest partial path (for each
+			 * useful pathkeys)
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* continue */
+			}
+
+			/* finally, consider incremental sort */
+			if (presorted_keys > 0)
+			{
+				Path *tmp;
+
+				/* Also consider incremental sort. */
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2902,7 +3263,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index bfb52f21ab..c2877942cb 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -5932,7 +5932,10 @@ prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 				}
 			}
 			if (!j)
-				elog(ERROR, "could not find pathkey item to sort");
+			{
+				elog(WARNING, "could not find pathkey item to sort");
+				Assert(false);
+			}
 
 			/*
 			 * Do we need to insert a Result node?
@@ -6491,7 +6494,10 @@ make_unique_from_pathkeys(Plan *lefttree, List *pathkeys, int numCols)
 		}
 
 		if (!tle)
-			elog(ERROR, "could not find pathkey item to sort");
+		{
+			elog(WARNING, "could not find pathkey item to sort");
+			Assert(false);
+		}
 
 		/*
 		 * Look up the correct equality operator from the PathKey's slightly
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 16996b1bc2..0939f2f7b9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5068,6 +5068,48 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/* also consider incremental sorts on all partial paths */
+		{
+			ListCell *lc;
+			foreach (lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				/* already handled above */
+				if (input_path == cheapest_partial_path)
+					continue;
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys, &presorted_keys);
+
+				/* also ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys > 0)
+				{
+					/* Also consider incremental sort. */
+					sorted_path = (Path *) create_incremental_sort_path(root,
+																		ordered_rel,
+																		input_path,
+																		root->sort_pathkeys,
+																		presorted_keys,
+																		limit_tuples);
+
+					/* Add projection step if needed */
+					if (sorted_path->pathtarget != target)
+						sorted_path = apply_projection_to_path(root, ordered_rel,
+															   sorted_path, target);
+
+					add_path(ordered_rel, sorted_path);
+				}
+			}
+
+		}
 	}
 
 	/*
@@ -6484,6 +6526,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			}
 		}
 
+
+		/*
+		 * Use any available suitably-sorted path as input, with incremental
+		 * sort path.
+		 */
+		foreach(lc, input_rel->pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+				continue;
+
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
+		}
+
 		/*
 		 * Instead of operating directly on the input relation, we can
 		 * consider finalizing a partially aggregated path.
@@ -6530,6 +6646,53 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   havingQual,
 											   dNumGroups));
 			}
+
+			/* incremental sort */
+			foreach(lc, partially_grouped_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
+			}
+
 		}
 	}
 
@@ -6798,6 +6961,57 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Use any available suitably-sorted path as input, and also consider
+		 * sorting the cheapest partial path.
+		 */
+		foreach(lc, input_rel->pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* also ignore already sorted paths */
+			if (is_sorted)
+				continue;
+
+			if (presorted_keys == 0)
+				continue;
+
+			/* add incremental sort */
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_path(partially_grouped_rel, (Path *)
+						 create_agg_path(root,
+										 partially_grouped_rel,
+										 path,
+										 partially_grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_INITIAL_SERIAL,
+										 parse->groupClause,
+										 NIL,
+										 agg_partial_costs,
+										 dNumPartialGroups));
+			else
+				add_path(partially_grouped_rel, (Path *)
+						 create_group_path(root,
+										   partially_grouped_rel,
+										   path,
+										   parse->groupClause,
+										   NIL,
+										   dNumPartialGroups));
+		}
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6842,6 +7056,52 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   dNumPartialPartialGroups));
 			}
 		}
+
+		/* consider incremental sort */
+		foreach(lc, input_rel->partial_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+				continue;
+
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
+		}
 	}
 
 	if (can_hash && cheapest_total_path != NULL)
@@ -6938,6 +7198,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
@@ -6967,6 +7228,44 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	/* also consider incremental sort on all partial paths */
+	foreach (lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
+
 }
 
 /*
@@ -7222,7 +7521,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index e7a40cec3f..20fa94281b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
-- 
2.21.0

0002-fix-costing-in-cost_incremental_sort-20190709.patchtext/plain; charset=us-asciiDownload

From 3a7bcb8940911029ba5927e9993a93f86139f7f4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2019 00:13:04 +0200
Subject: [PATCH 2/4] fix costing in cost_incremental_sort

---
 src/backend/optimizer/path/costsize.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7f820e7351..c6aa17ba67 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1875,16 +1875,8 @@ cost_incremental_sort(Path *path,
 				   limit_tuples);
 
 	/* If we have a LIMIT, adjust the number of groups we'll have to return. */
-	if (limit_tuples > 0 && limit_tuples < input_tuples)
-	{
-		output_tuples = limit_tuples;
-		output_groups = floor(output_tuples / group_tuples) + 1;
-	}
-	else
-	{
-		output_tuples = input_tuples;
-		output_groups = input_groups;
-	}
+	output_tuples = input_tuples;
+	output_groups = input_groups;
 
 	/*
 	 * Startup cost of incremental sort is the startup cost of its first group
-- 
2.21.0

0003-fix-explain-in-parallel-mode-20190709.patchtext/plain; charset=us-asciiDownload

From 6a439153321e40777d791b70424424a501485408 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2019 00:13:34 +0200
Subject: [PATCH 3/4] fix explain in parallel mode

---
 src/backend/commands/explain.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d3f855a12a..925e8236ba 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2775,7 +2775,7 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 								 fullsort_spaceUsed, fullsort_group_count);
 				if (prefixsort_instrument)
 					appendStringInfo(es->str,
-									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld",
+									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
 									 prefixsort_sortMethod, prefixsort_spaceType,
 									 prefixsort_spaceUsed, prefixsort_group_count);
 				else
-- 
2.21.0

0004-add-force_incremental_sort-GUC-20190709.patchtext/plain; charset=us-asciiDownload

From 9dc5530429083195e90129b3eb18f8d7a5f78451 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2019 13:36:42 +0200
Subject: [PATCH 4/4] add force_incremental_sort GUC

---
 src/backend/optimizer/path/costsize.c |  8 ++++++++
 src/backend/utils/misc/guc.c          | 10 ++++++++++
 src/include/optimizer/cost.h          |  1 +
 3 files changed, 19 insertions(+)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c6aa17ba67..ee4487b158 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -140,6 +140,8 @@ bool		enable_parallel_append = true;
 bool		enable_parallel_hash = true;
 bool		enable_partition_pruning = true;
 
+bool		force_incremental_sort = true;
+
 typedef struct
 {
 	PlannerInfo *root;
@@ -1907,6 +1909,12 @@ cost_incremental_sort(Path *path,
 	path->rows = input_tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	if (force_incremental_sort)
+	{
+		path->startup_cost = input_startup_cost;
+		path->total_cost = input_total_cost;
+	}
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f922ee66a0..d2cc5b56b5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -941,6 +941,16 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"force_incremental_sort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Makes the incremental sort look like no-cost."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&force_incremental_sort,
+		false,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of incremental sort steps."),
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b9d7a77e65..bad1d5a330 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -66,6 +66,7 @@ extern PGDLLIMPORT bool enable_parallel_append;
 extern PGDLLIMPORT bool enable_parallel_hash;
 extern PGDLLIMPORT bool enable_partition_pruning;
 extern PGDLLIMPORT int constraint_exclusion;
+extern PGDLLIMPORT bool force_incremental_sort;
 
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 								  double index_pages, PlannerInfo *root);
-- 
2.21.0

#146

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#145)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 12:07:06PM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 10:58 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:

On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

We're running query like this:

SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3

but we're trying to add the incremental sort *before* the aggregation,
because the optimizer also considers group aggregate with a sorted
input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
actually can do this, but clearly that's nonsense, because we can't
possibly know the aggregates yet. Hence the error.

If this is the actual issue, we need to ensure we actually can evaluate
all the pathkeys. I don't know how to do that yet. I thought that maybe
we should modify pathkeys_common_contained_in() to set presorted_keys to
0 in this case.

But then I started wondering why we don't see this issue even for
regular (non-incremental-sort) paths built in create_ordered_paths().
How come we don't see these failures there? I've modified costing to
make all incremental sort paths very cheap, and still nothing.

I assume you mean you modified costing to make regular sort paths very cheap?

No, I mean costing of incremental sort paths, so that they end up being
the cheapest ones. If some other path is cheaper, we won't see the error
because it only happens when building plan from the cheapest path.

Ah, I misunderstood as you trying to figure out a way to try to cause
the same problem with a regular sort.

So presumably there's a check elsewhere (either implicit or explicit),
because create_ordered_paths() uses pathkeys_common_contained_in() and
does not have the same issue.

Given this comment in create_ordered_paths():

generate_gather_paths() will have already generated a simple Gather
path for the best parallel path, if any, and the loop above will have
considered sorting it. Similarly, generate_gather_paths() will also
have generated order-preserving Gather Merge plans which can be used
without sorting if they happen to match the sort_pathkeys, and the loop
above will have handled those as well. However, there's one more
possibility: it may make sense to sort the cheapest partial path
according to the required output order and then use Gather Merge.

my understanding is that generate_gather_paths() only considers paths
that already happen to be sorted (not explicit sorts), so I'm
wondering if it would make more sense for the incremental sort path
creation for this case to live alongside the explicit ordered path
creation in create_ordered_paths() rather than in
generate_gather_paths().

How would that solve the issue? Also, we're building a gather path, so
I think generate_gather_paths() is the right place where to do it. And
we're not changing the semantics of generate_gather_paths() - the result
path should be sorted "correctly" with respect to sort_pathkeys.

Does that imply what the explicit sort in create_ordered_paths() is in
the wrong spot?

I think those are essentially the right places where to do this sort of
stuff. Maybe there's a better place, but I don't think those places are
somehow wrong.

Or, to put it another way, do you think that both kinds of sorts
should be added in the same place? It seems confusing to me that
they'd be split between the two methods (unless I'm completely
misunderstanding how the two work).

The paths built in those two places were very different in one aspect:

1) generate_gather_paths only ever looked at pathkeys for the subpath, it
never even looked at ordering expected by paths above it (or the query as
a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
to maintain ordering of the different subpaths.

2) create_oredered_paths is meant to enforce ordering needed by upper
parts of the plan - either by using a properly sorted path, or adding an
explicit sort.

We want to extend (1) to also look at ordering expected by the upper parts
of the plan, and consider incremental sort if applicable. (2) already does
that, and it already has the correct pathkeys to enforce.

I guess I'm still not following. If (2) is responsible (currently) for
adding an explicit sort, why wouldn't adding an incremental sort be an
example of that responsibility? The subpath that either a Sort or
IncrementalSort is being added on top of (to then feed into the
GatherMerge) is the same in both cases right?

Unless you're saying that the difference is that since we have a
partial ordering already for incremental sort then incremental sort
falls into the category of "maintaining" existing ordering of the
subpath?

Oh, I think I understand what you're saying. Essentially, we should not
be making generate_gather_paths responsible for adding the incremental
sort. Instead, we should be looking at places than are adding explicit
sort (using create_sort_path) and also consider adding incremental sort.

I definitely agree with the second half - we should look at all places
that create explicit sorts and make them also consider incremental
sorts. That makes sense.

Yep, exactly.

But I'm not sure it'll address all cases - the problem is that those
places add the explicit sort because they need sorted input. Gather
Merge does not do that, it only preserves existing ordering of paths.

So it's possible the path does not have an explicit sort on to, and
gather merge will not know to add it. And once we have the gather merge
in place, we can't push place "under" it.

That's the explanation I was missing; and it makes sense (to restate:
sometimes sorting is useful even when not required for correctness of
the user returned data).

In fact, we already have code dealing with this "issue" for a special
case - see gather_grouping_paths(). It generates plain gather merge
paths, but then also considers building one with explicit sort. But it
only does that for grouping paths (when it's clear we need to be looking
at grouping_pathkeys), and there are generate_gather_paths() that don't
have similar treatment.

I just find it humorous both of us were writing separate emails
mentioning that function at the same time.

But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-(

The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys
and group_pathkeys too, so maybe we should be looking at those too?

I don't know enough yet to answer, but I'd like to look at (in the
debugger) the subpaths considered in each function to try to get a
better understanding of why we don't try to explicitly sort the aggs
(which we know we can't sort yet) but do for incremental sort. I
assume that means a subpath has to be present in one but not the other
since they both use the same pathkey checking function.

I've been wondering if we have some other code that needs to consider
interesting pathkeys "candidates" (instead of just getting the list
interesting in that place). Because then we could look at that code and
use it here ...

And guess what - postgres_fdw needs to do pretty much exactly that, when
building paths for remote relations. AFAIK we can't easily request all
plans from the remote node and then look at their pathkeys (like we'd do
with local node), so instead we deduce "interesting pathkeys" and then
request best plans for those. And deducing "interesing" pathkeys is
pretty much what get_useful_pathkeys_for_relation() is about.

So I've copied this function (and two more, called from it), whacked it
a bit until it removed (shakespeare-writing chimp comes to mind) and
voila, it seems to be working. The errors you reported are gone, and the
plans seems to be reasonable.

Attached is a sequence of 4 patches:

0001-fix-pathkey-processing-in-generate_gather_paths.patch
----------------------------------------------------------
This is the fixed version of my previous patch, with the stuff stolen
from postgres_fdw.

0002-fix-costing-in-cost_incremental_sort.patch
-----------------------------------------------
This is the costing fix, I mentioned before.

0003-fix-explain-in-parallel-mode.patch
---------------------------------------
Minor bug in explain, when incremental sort ends up being in the
parallel part of the plan (missing newline on per-worker line)

0004-rework-where-incremental-sort-paths-are-created.patch
----------------------------------------------------------
This undoes the generate_gather_paths() changes from 0001, and instead
modifies a bunch of places that call create_sort_path() to also consider
incremental sorts. There are a couple remaining, but those should not be
relevant to the queries I've been looking at.

Essentially, 0002 and 0003 are bugfixes. 0001 and 0004 are the two
different aproaches to building incremental sort + gather merge.

Now, consider this example:

create table t (a int, b int, c int);
insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
create index on t (a);
analyze t;
explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;

With 0001+0002+0003 pathes, I get a plan like this:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=10375.39..10594.72 rows=1 width=16)
-> Incremental Sort (cost=10375.39..2203675.71 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=10156.07..2203225.71 rows=10000 width=16)
Group Key: a, b
-> Gather Merge (cost=10156.07..2128124.39 rows=10000175 width=12)
Workers Planned: 2
-> Incremental Sort (cost=9156.04..972856.05 rows=4166740 width=12)
Sort Key: a, b
Presorted Key: a
-> Parallel Index Scan using t_a_idx on t (cost=0.43..417690.30 rows=4166740 width=12)
(12 rows)

and with 0004, I get this:

QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=20443.84..20665.32 rows=1 width=16)
-> Incremental Sort (cost=20443.84..2235250.05 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=20222.37..2234800.05 rows=10000 width=16)
Group Key: a, b
-> Incremental Sort (cost=20222.37..2159698.74 rows=10000175 width=12)
Sort Key: a, b
Presorted Key: a
-> Index Scan using t_a_idx on t (cost=0.43..476024.65 rows=10000175 width=12)
(10 rows)

Notice that cost of the second plan is almost double the first one. That
means 0004 does not even generate the first plan, i.e. there are cases
where we don't try to add the explicit sort before passing the path to
generate_gather_paths().

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.

I had also noticed that that was an obvious place where
generate_gather_paths() was used but an explicit sort wasn't also
added separately, which makes me think the division of labor is
probably currently wrong regardless of the incremental sort patch.

Do you agree? Should we try to fix that (likely with your new
"interesting paths" version of generate_gather_paths()) first as a
prefix patch to adding incremental sort?

FWIW tweaking all the create_sort_path() places to also consider adding
incremental sort is a bit tedious and invasive, and it almost doubles
the amount of repetitive code. It's OK for experiment like this, but we
should try handling this in a nicer way (move to a separate function
that does both, or something like that).

Agreed.

On Tue, Jul 9, 2019 at 8:11 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jul 09, 2019 at 03:37:03AM +0200, Tomas Vondra wrote:

...

Notice that cost of the second plan is almost double the first one. That
means 0004 does not even generate the first plan, i.e. there are cases
where we don't try to add the explicit sort before passing the path to
generate_gather_paths().

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.

I've looked into this again, and yes - that's the reason. I've added
generate_useful_gather_paths() which is a wrapper on top of
generate_gather_paths(). It essentially does what 0001 patch did directly
in generate_gather_paths() so it's more like create_grouping_paths().

And that does the trick - we now generate the cheaper paths, and I don't
see any crashes in regression tests etc.

I still suspect we already have code doing similar checks whether pathkeys
might be useful somewhere. I've looked into pathkeys.c and elsewhere but
no luck.

Attached is a slightly modified patch series:

1) 0001 considers incremental sort paths in various places (adds the new
generate_useful_gather_paths and modifies places calling create_sort_path)

I need to spend some decent time digesting this patch, but the concept
sounds very useful.

2) 0002 and 0003 are fixes I mentioned before

3) 0004 adds a new GUC force_incremental_sort that (when set to 'on')
tries to nudge the optimizer into using incremental sort by essentially
making it free (i.e. using startup/total costs of the subpath). I've found
this useful when trying to force incremental sorts into plans where it may
not be the best strategy.

That will be super helpful. I do wonder if we need to expose (in the
production patch) a GUC of some kind to adjust incremental sort
costing so that users can try to tweak when it is preferred over
regular sort.

I won't have time to hack on this over the next ~2 weeks, but I'll try to
respond to questions when possible.

Understood; thanks so much for your help on this.

FWIW tweaking all the create_sort_path() places to also consider adding
incremental sort is a bit tedious and invasive, and it almost doubles
the amount of repetitive code. It's OK for experiment like this, but we
should try handling this in a nicer way (move to a separate function
that does both, or something like that).

This definitely needs more work. We need to refactor it in some way, e.g.
have a function that would consider both explicit sort (on the cheapest
path) and incremental sort (on all paths), and call it from all those
places. Haven't tried it, though.

There's also a couple more places where we do create_sort_path() and don't
consider incremental sort yet - window functions, distinct etc.

Yep, and likely want regression tests for all of these cases also.

James Coleman

#147

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Alexander Korotkov (#142)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jul 8, 2019 at 6:37 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Thu, Jul 4, 2019 at 4:25 PM James Coleman <jtc331@gmail.com> wrote:

Process questions:
- Do I need to explicitly move the patch somehow to the next CF?

We didn't manage to register it on current (July) commitfest. So,
please, register it on next (September) commitfest.

I've moved it to the September cf.

- Since I've basically taken over patch ownership, should I move my
name from reviewer to author in the CF app? And can there be two
authors listed there?

Sure, you're co-author of this patch. Two or more authors could be
listed at CF app, you can find a lot of examples on the list.

Done.

James Coleman

#148

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#146)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jul 09, 2019 at 09:28:42AM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 12:07:06PM -0400, James Coleman wrote:

...

I guess I'm still not following. If (2) is responsible (currently) for
adding an explicit sort, why wouldn't adding an incremental sort be an
example of that responsibility? The subpath that either a Sort or
IncrementalSort is being added on top of (to then feed into the
GatherMerge) is the same in both cases right?

Unless you're saying that the difference is that since we have a
partial ordering already for incremental sort then incremental sort
falls into the category of "maintaining" existing ordering of the
subpath?

Oh, I think I understand what you're saying. Essentially, we should not
be making generate_gather_paths responsible for adding the incremental
sort. Instead, we should be looking at places than are adding explicit
sort (using create_sort_path) and also consider adding incremental sort.

I definitely agree with the second half - we should look at all places
that create explicit sorts and make them also consider incremental
sorts. That makes sense.

Yep, exactly.

If I remember correctly, one of the previous patch versions (in the early
2018 commitfests) actually modified many of those places, but it did that
in a somewhat "naive" way. It simply used incremental sort whenever the
path was partially sorted, or something like that. So it did not use
costing properly. There was an attempt to fix that in the last commitfest
but the costing model was deemed to be a bit too rough and unreliable
(especially the ndistinct estimates can be quite weak), so the agreement
was to try salvaging the patch for PG11 by only considering incremental
sort in "safest" places with greatest gains.

We've significantly improved the costing model since then, and the
implementation likely handles the corner cases much better. But that does
not mean we have to introduce the incremental sort to all those places at
once - it might be wise to split that into separate patches.

It's not just about picking the right plan - we've kinda what impact these
extra paths might have on planner performance, so maybe we should look at
that too. And the impact might be different for each of those places.

I'll leave that up to you, but I certainly won't insist on doing it all in
one huge patch.

But I'm not sure it'll address all cases - the problem is that those
places add the explicit sort because they need sorted input. Gather
Merge does not do that, it only preserves existing ordering of paths.

So it's possible the path does not have an explicit sort on to, and
gather merge will not know to add it. And once we have the gather merge
in place, we can't push place "under" it.

That's the explanation I was missing; and it makes sense (to restate:
sometimes sorting is useful even when not required for correctness of
the user returned data).

Yes, although even when the sorting is required for correctness (because
the user specified ORDER BY) you can do it at different points in the
plan. We'd still produce correct results, but the sort might be done at
the very end without these changes.

For example we might end up with plans

Incremental Sort (presorted: a, path keys: a,b)
-> Gather Merge (path keys: a)
-> Index Scan (path keys: a)

but with those changes we might push the incremental sort down into the
parallel part:

Gather Merge (path keys: a,b)
-> Incremental Sort (presorted: a, path keys: a,b)
-> Index Scan (path keys: a)

which is likely better. Perhaps that's what you meant, though.

In fact, we already have code dealing with this "issue" for a special
case - see gather_grouping_paths(). It generates plain gather merge
paths, but then also considers building one with explicit sort. But it
only does that for grouping paths (when it's clear we need to be looking
at grouping_pathkeys), and there are generate_gather_paths() that don't
have similar treatment.

I just find it humorous both of us were writing separate emails
mentioning that function at the same time.

;-)

...

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.

I had also noticed that that was an obvious place where
generate_gather_paths() was used but an explicit sort wasn't also
added separately, which makes me think the division of labor is
probably currently wrong regardless of the incremental sort patch.

Do you agree? Should we try to fix that (likely with your new
"interesting paths" version of generate_gather_paths()) first as a
prefix patch to adding incremental sort?

I'm not sure what the generate_useful_gather_paths() should do but does
not? Or is it just the division of labor that you think is wrong? In any
case, feel free to whack it until you're happy with it.

...

Attached is a slightly modified patch series:

1) 0001 considers incremental sort paths in various places (adds the new
generate_useful_gather_paths and modifies places calling create_sort_path)

I need to spend some decent time digesting this patch, but the concept
sounds very useful.

OK.

2) 0002 and 0003 are fixes I mentioned before

3) 0004 adds a new GUC force_incremental_sort that (when set to 'on')
tries to nudge the optimizer into using incremental sort by essentially
making it free (i.e. using startup/total costs of the subpath). I've found
this useful when trying to force incremental sorts into plans where it may
not be the best strategy.

That will be super helpful. I do wonder if we need to expose (in the
production patch) a GUC of some kind to adjust incremental sort
costing so that users can try to tweak when it is preferred over
regular sort.

This GUC is really meant primarily for development, to force choice of
incremental sort during regression tests (so as to use incremental sort in
as many plans as possible). I'd remove it from the final patch. I think
the general consensus on pgsql-hackers is that we should not introduce
GUCs unless absolutely necessary. But for development GUCs - sure.

FWIW I'm not sure it's a good idea to look at both enable_incremental_sort
and enable_sort in cost_incremental_sort(). Not only end up with
disable_cost twice when both GUCs are set to 'off' at the moment, but it
might be useful to be able to disable those two sort types independently.
For example you might set just enable_sort=off and we'd still generate
incremental sort paths.

...

This definitely needs more work. We need to refactor it in some way, e.g.
have a function that would consider both explicit sort (on the cheapest
path) and incremental sort (on all paths), and call it from all those
places. Haven't tried it, though.

There's also a couple more places where we do create_sort_path() and don't
consider incremental sort yet - window functions, distinct etc.

Yep, and likely want regression tests for all of these cases also.

Yes, that's definitely something a committable patch needs to include.

Another thing we should have is a collection of tests with data sets that
"break" the costing model in some way (skew, correlated columns,
non-uniform group sizes, ...). That's something not meant for commit,
because it'll probably require significant amounts of data, but we need it
to asses the quality of the planner/costing part. I know there are various
ad-hoc test cases in the thread history, it'd be good to consolidate that
into once place.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#149

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#148)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jul 9, 2019 at 10:54 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jul 09, 2019 at 09:28:42AM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 12:07:06PM -0400, James Coleman wrote:

...

I guess I'm still not following. If (2) is responsible (currently) for
adding an explicit sort, why wouldn't adding an incremental sort be an
example of that responsibility? The subpath that either a Sort or
IncrementalSort is being added on top of (to then feed into the
GatherMerge) is the same in both cases right?

Unless you're saying that the difference is that since we have a
partial ordering already for incremental sort then incremental sort
falls into the category of "maintaining" existing ordering of the
subpath?

Oh, I think I understand what you're saying. Essentially, we should not
be making generate_gather_paths responsible for adding the incremental
sort. Instead, we should be looking at places than are adding explicit
sort (using create_sort_path) and also consider adding incremental sort.

I definitely agree with the second half - we should look at all places
that create explicit sorts and make them also consider incremental
sorts. That makes sense.

Yep, exactly.

If I remember correctly, one of the previous patch versions (in the early
2018 commitfests) actually modified many of those places, but it did that
in a somewhat "naive" way. It simply used incremental sort whenever the
path was partially sorted, or something like that. So it did not use
costing properly. There was an attempt to fix that in the last commitfest
but the costing model was deemed to be a bit too rough and unreliable
(especially the ndistinct estimates can be quite weak), so the agreement
was to try salvaging the patch for PG11 by only considering incremental
sort in "safest" places with greatest gains.

We've significantly improved the costing model since then, and the
implementation likely handles the corner cases much better. But that does
not mean we have to introduce the incremental sort to all those places at
once - it might be wise to split that into separate patches.

Yes, although we haven't added the MCV checking yet; that's on my
mental checklist, but another new area of the codebase for me to
understand, so I've been prioritizing other parts of the patch.

It's not just about picking the right plan - we've kinda what impact these
extra paths might have on planner performance, so maybe we should look at
that too. And the impact might be different for each of those places.

I'll leave that up to you, but I certainly won't insist on doing it all in
one huge patch.

I'm not opposed to handling some of them separately, but I would like
to at least hit the places where it's most likely (for example, with
LIMIT) to improve things. I supposed I'll have to look at all of the
usages of create_sort_path() and try to rank them in terms of
perceived likely value.

But I'm not sure it'll address all cases - the problem is that those
places add the explicit sort because they need sorted input. Gather
Merge does not do that, it only preserves existing ordering of paths.

So it's possible the path does not have an explicit sort on to, and
gather merge will not know to add it. And once we have the gather merge
in place, we can't push place "under" it.

That's the explanation I was missing; and it makes sense (to restate:
sometimes sorting is useful even when not required for correctness of
the user returned data).

Yes, although even when the sorting is required for correctness (because
the user specified ORDER BY) you can do it at different points in the
plan. We'd still produce correct results, but the sort might be done at
the very end without these changes.

For example we might end up with plans

Incremental Sort (presorted: a, path keys: a,b)
-> Gather Merge (path keys: a)
-> Index Scan (path keys: a)

but with those changes we might push the incremental sort down into the
parallel part:

Gather Merge (path keys: a,b)
-> Incremental Sort (presorted: a, path keys: a,b)
-> Index Scan (path keys: a)

which is likely better. Perhaps that's what you meant, though.

I was thinking of ordering being useful for grouping/aggregation or
merge joins; I didn't realize the above plan wasn't possible yet, so
that explanation is helpful.

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.

I had also noticed that that was an obvious place where
generate_gather_paths() was used but an explicit sort wasn't also
added separately, which makes me think the division of labor is
probably currently wrong regardless of the incremental sort patch.

Do you agree? Should we try to fix that (likely with your new
"interesting paths" version of generate_gather_paths()) first as a
prefix patch to adding incremental sort?

I'm not sure what the generate_useful_gather_paths() should do but does
not? Or is it just the division of labor that you think is wrong? In any
case, feel free to whack it until you're happy with it.

Oh, I didn't mean to imply generate_useful_gather_paths() was
deficient in some way; I was noting that I'd noticed that
apply_scanjoin_target_to_paths() didn't consider explicit sort +
gather merge on master, and that maybe that was an unintentional miss
and something that should be remedied.

And if it is a miss, then that would demonstrate in my mind that the
addition of generate_useful_gather_paths() would be a helpful refactor
as a standalone patch even without incremental sort. So "currently
wrong" above meant "on master" not "in the current version of the
patch."

Attached is a slightly modified patch series:

1) 0001 considers incremental sort paths in various places (adds the new
generate_useful_gather_paths and modifies places calling create_sort_path)

I need to spend some decent time digesting this patch, but the concept
sounds very useful.

OK.

I've been reading this several times + stepping through with the
debugger to understand when this is useful, but I have a few
questions.

The first case considered in get_useful_pathkeys_for_relation (which
considers root->query_pathkeys) makes a lot of sense -- obviously if
we want sorted results then it's useful to consider both full and
incremental sort.

But I'm not sure I see yet a way to trigger the second case (which
uses get_useful_ecs_for_relation to build pathkeys potentially useful
for merge joins). In the FDW case we need to consider it since we want
to avoid local sort and so want to see if the foreign server might be
able to provide us useful presorted data, but in the local case I
don't think that's useful. From what I can tell merge join path
costing internally considers possible sorts of both the inner and
outer input paths, and merge join plan node creation is responsible
for building explicit sort nodes as necessary (i.e., there are no
explicit sort paths created for merge join paths.) That means that,
for example, a query like:

select * from
(select * from j2 order by j2.t, j2.a) j2
join (select * from j1 order by j1.t) j1
on j1.t = j2.t and j1.a = j2.a;

don't consider incremental sort for the merge join path (I disabled
hash joins, nested loops, and full sorts testing that on an empty
table just to easily force a merge join plan). And unfortunately I
don't think there's an easy way to remedy that: from what I can tell
it'd be a pretty invasive patch requiring refactoring merge join
costing to consider both kinds of sorting (in the most simple
implementation that would mean considering up to 4x as many merge join
paths -- inner/outer sorted by: full/full, full/incremental,
incremental/full, and incremental/incremental). Given that's a
significant undertaking on its own, I think I'm going to avoid
addressing it as part of this patch.

If it's true that the get_useful_ecs_for_relation part of that logic
isn't actually exercisable currently, that that would cut down
significantly on the amount of code that needs to be added for
consideration of valid gather merge paths. But if you can think of a
counterexample, please let me know.

2) 0002 and 0003 are fixes I mentioned before

I'm incorporating those with a bit of additional cleanup.

3) 0004 adds a new GUC force_incremental_sort that (when set to 'on')
tries to nudge the optimizer into using incremental sort by essentially
making it free (i.e. using startup/total costs of the subpath). I've found
this useful when trying to force incremental sorts into plans where it may
not be the best strategy.

That will be super helpful. I do wonder if we need to expose (in the
production patch) a GUC of some kind to adjust incremental sort
costing so that users can try to tweak when it is preferred over
regular sort.

This GUC is really meant primarily for development, to force choice of
incremental sort during regression tests (so as to use incremental sort in
as many plans as possible). I'd remove it from the final patch. I think
the general consensus on pgsql-hackers is that we should not introduce
GUCs unless absolutely necessary. But for development GUCs - sure.

FWIW I'm not sure it's a good idea to look at both enable_incremental_sort
and enable_sort in cost_incremental_sort(). Not only end up with
disable_cost twice when both GUCs are set to 'off' at the moment, but it
might be useful to be able to disable those two sort types independently.
For example you might set just enable_sort=off and we'd still generate
incremental sort paths.

That would cover the usage case I was getting at. Having enable_sort
disable incremental sort also came in without much explanation [1]/messages/by-id/CAPpHfdtKHETXhf062CPvkjpG1wnjQ7rv4uLhZgYQ6VZjwqDYpg@mail.gmail.com:

On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <
a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:

Also some other changes from me:
...
enable_sort should act as a cost-based soft disable for both incremental
and normal sort.

I wasn't sure that fully made sense to me, but was assuming the idea
was to effectively not introduce a regression for anyone already
disabling sort to force a specific plan shape. That being said, any
new execution node/planning feature can cause a regression in such
"hinted" queries, so I'm not sure that's a good reason on its own. In
any case, incremental sort is different enough in performance
qualities that I think you'd want to re-evaluate possible plans in
queries where enable_sort=off is useful, so I'm going make incremental
sort independent of enable_sort unless there's a strong objection
here.

Tangentially: I'd almost expect enable_incremental_sort to act as a
hard disable (and not even generate the paths) rather than a soft
cost-based disable, since while standard sort is the most basic
operation that needs to always be available as a last resort the same
is not true for incremental sort...

Another thing we should have is a collection of tests with data sets that
"break" the costing model in some way (skew, correlated columns,
non-uniform group sizes, ...). That's something not meant for commit,
because it'll probably require significant amounts of data, but we need it
to asses the quality of the planner/costing part. I know there are various
ad-hoc test cases in the thread history, it'd be good to consolidate that
into once place.

Agreed.

I'm continuing to work on the planning side of this with the goal of
not needing to modify too many places to consider an incremental sort
path + considering which ones are most likely to be useful, but I
wanted to get my questions about get_useful_ecs_for_relation sent out
while I work on that.

If we end up wanting to limit the number of places we consider
incremental sort (whether for planning time or merely for size of the
initial patch, do you have any thoughts on what general areas we
should consider most valuable? Besides the obvious LIMIT case aother
area that might benefit was min/max, though I'm not sure yet at the
moment if that would really end up meaning considering it all over the
place.

James Coleman

[1]: /messages/by-id/CAPpHfdtKHETXhf062CPvkjpG1wnjQ7rv4uLhZgYQ6VZjwqDYpg@mail.gmail.com

#150

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#149)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Jul 14, 2019 at 02:38:40PM -0400, James Coleman wrote:

On Tue, Jul 9, 2019 at 10:54 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jul 09, 2019 at 09:28:42AM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Jul 08, 2019 at 12:07:06PM -0400, James Coleman wrote:

...

I guess I'm still not following. If (2) is responsible (currently) for
adding an explicit sort, why wouldn't adding an incremental sort be an
example of that responsibility? The subpath that either a Sort or
IncrementalSort is being added on top of (to then feed into the
GatherMerge) is the same in both cases right?

Unless you're saying that the difference is that since we have a
partial ordering already for incremental sort then incremental sort
falls into the category of "maintaining" existing ordering of the
subpath?

Oh, I think I understand what you're saying. Essentially, we should not
be making generate_gather_paths responsible for adding the incremental
sort. Instead, we should be looking at places than are adding explicit
sort (using create_sort_path) and also consider adding incremental sort.

I definitely agree with the second half - we should look at all places
that create explicit sorts and make them also consider incremental
sorts. That makes sense.

Yep, exactly.

If I remember correctly, one of the previous patch versions (in the early
2018 commitfests) actually modified many of those places, but it did that
in a somewhat "naive" way. It simply used incremental sort whenever the
path was partially sorted, or something like that. So it did not use
costing properly. There was an attempt to fix that in the last commitfest
but the costing model was deemed to be a bit too rough and unreliable
(especially the ndistinct estimates can be quite weak), so the agreement
was to try salvaging the patch for PG11 by only considering incremental
sort in "safest" places with greatest gains.

We've significantly improved the costing model since then, and the
implementation likely handles the corner cases much better. But that does
not mean we have to introduce the incremental sort to all those places at
once - it might be wise to split that into separate patches.

Yes, although we haven't added the MCV checking yet; that's on my
mental checklist, but another new area of the codebase for me to
understand, so I've been prioritizing other parts of the patch.

Sure, no problem.

It's not just about picking the right plan - we've kinda what impact these
extra paths might have on planner performance, so maybe we should look at
that too. And the impact might be different for each of those places.

I'll leave that up to you, but I certainly won't insist on doing it all in
one huge patch.

I'm not opposed to handling some of them separately, but I would like
to at least hit the places where it's most likely (for example, with
LIMIT) to improve things. I supposed I'll have to look at all of the
usages of create_sort_path() and try to rank them in terms of
perceived likely value.

Yep, makes sense.

But I'm not sure it'll address all cases - the problem is that those
places add the explicit sort because they need sorted input. Gather
Merge does not do that, it only preserves existing ordering of paths.

So it's possible the path does not have an explicit sort on to, and
gather merge will not know to add it. And once we have the gather merge
in place, we can't push place "under" it.

That's the explanation I was missing; and it makes sense (to restate:
sometimes sorting is useful even when not required for correctness of
the user returned data).

Yes, although even when the sorting is required for correctness (because
the user specified ORDER BY) you can do it at different points in the
plan. We'd still produce correct results, but the sort might be done at
the very end without these changes.

For example we might end up with plans

Incremental Sort (presorted: a, path keys: a,b)
-> Gather Merge (path keys: a)
-> Index Scan (path keys: a)

but with those changes we might push the incremental sort down into the
parallel part:

Gather Merge (path keys: a,b)
-> Incremental Sort (presorted: a, path keys: a,b)
-> Index Scan (path keys: a)

which is likely better. Perhaps that's what you meant, though.

I was thinking of ordering being useful for grouping/aggregation or
merge joins; I didn't realize the above plan wasn't possible yet, so
that explanation is helpful.

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.

I had also noticed that that was an obvious place where
generate_gather_paths() was used but an explicit sort wasn't also
added separately, which makes me think the division of labor is
probably currently wrong regardless of the incremental sort patch.

Do you agree? Should we try to fix that (likely with your new
"interesting paths" version of generate_gather_paths()) first as a
prefix patch to adding incremental sort?

I'm not sure what the generate_useful_gather_paths() should do but does
not? Or is it just the division of labor that you think is wrong? In any
case, feel free to whack it until you're happy with it.

Oh, I didn't mean to imply generate_useful_gather_paths() was
deficient in some way; I was noting that I'd noticed that
apply_scanjoin_target_to_paths() didn't consider explicit sort +
gather merge on master, and that maybe that was an unintentional miss
and something that should be remedied.

And if it is a miss, then that would demonstrate in my mind that the
addition of generate_useful_gather_paths() would be a helpful refactor
as a standalone patch even without incremental sort. So "currently
wrong" above meant "on master" not "in the current version of the
patch."

Ah, OK. I don't know if it's a miss or intentional omission - it might
be a mix of both, actually. I wouldn't be surprised if doing that was
considered and considered not worth it, but if we can come up with a
plan where adding an explicit sort here helps ...

Attached is a slightly modified patch series:

1) 0001 considers incremental sort paths in various places (adds the new
generate_useful_gather_paths and modifies places calling create_sort_path)

I need to spend some decent time digesting this patch, but the concept
sounds very useful.

OK.

I've been reading this several times + stepping through with the
debugger to understand when this is useful, but I have a few
questions.

The first case considered in get_useful_pathkeys_for_relation (which
considers root->query_pathkeys) makes a lot of sense -- obviously if
we want sorted results then it's useful to consider both full and
incremental sort.

But I'm not sure I see yet a way to trigger the second case (which
uses get_useful_ecs_for_relation to build pathkeys potentially useful
for merge joins). In the FDW case we need to consider it since we want
to avoid local sort and so want to see if the foreign server might be
able to provide us useful presorted data, but in the local case I
don't think that's useful. From what I can tell merge join path
costing internally considers possible sorts of both the inner and
outer input paths, and merge join plan node creation is responsible
for building explicit sort nodes as necessary (i.e., there are no
explicit sort paths created for merge join paths.) That means that,
for example, a query like:

select * from
(select * from j2 order by j2.t, j2.a) j2
join (select * from j1 order by j1.t) j1
on j1.t = j2.t and j1.a = j2.a;

don't consider incremental sort for the merge join path (I disabled
hash joins, nested loops, and full sorts testing that on an empty
table just to easily force a merge join plan). And unfortunately I
don't think there's an easy way to remedy that: from what I can tell
it'd be a pretty invasive patch requiring refactoring merge join
costing to consider both kinds of sorting (in the most simple
implementation that would mean considering up to 4x as many merge join
paths -- inner/outer sorted by: full/full, full/incremental,
incremental/full, and incremental/incremental). Given that's a
significant undertaking on its own, I think I'm going to avoid
addressing it as part of this patch.

If it's true that the get_useful_ecs_for_relation part of that logic
isn't actually exercisable currently, that that would cut down
significantly on the amount of code that needs to be added for
consideration of valid gather merge paths. But if you can think of a
counterexample, please let me know.

It's quite possible parts of the code are not needed - I've pretty much
just copied the code and did minimal changes to get it working.

That being said, it's not clear to me why plans like this would not be
useful (or why would it require changes to merge join costing):

Merge Join
-> Gather Merge
-> Incremental Sort
-> Gather Merge
-> Incremental Sort

But I have not checked the code, so maybe it would, in which case I
think it's OK to skip it in this patch (or at least in v1 of it).

2) 0002 and 0003 are fixes I mentioned before

I'm incorporating those with a bit of additional cleanup.

3) 0004 adds a new GUC force_incremental_sort that (when set to 'on')
tries to nudge the optimizer into using incremental sort by essentially
making it free (i.e. using startup/total costs of the subpath). I've found
this useful when trying to force incremental sorts into plans where it may
not be the best strategy.

That will be super helpful. I do wonder if we need to expose (in the
production patch) a GUC of some kind to adjust incremental sort
costing so that users can try to tweak when it is preferred over
regular sort.

This GUC is really meant primarily for development, to force choice of
incremental sort during regression tests (so as to use incremental sort in
as many plans as possible). I'd remove it from the final patch. I think
the general consensus on pgsql-hackers is that we should not introduce
GUCs unless absolutely necessary. But for development GUCs - sure.

FWIW I'm not sure it's a good idea to look at both enable_incremental_sort
and enable_sort in cost_incremental_sort(). Not only end up with
disable_cost twice when both GUCs are set to 'off' at the moment, but it
might be useful to be able to disable those two sort types independently.
For example you might set just enable_sort=off and we'd still generate
incremental sort paths.

That would cover the usage case I was getting at. Having enable_sort
disable incremental sort also came in without much explanation [1]:

On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <
a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:

Also some other changes from me:
...
enable_sort should act as a cost-based soft disable for both incremental
and normal sort.

I wasn't sure that fully made sense to me, but was assuming the idea
was to effectively not introduce a regression for anyone already
disabling sort to force a specific plan shape. That being said, any
new execution node/planning feature can cause a regression in such
"hinted" queries, so I'm not sure that's a good reason on its own. In
any case, incremental sort is different enough in performance
qualities that I think you'd want to re-evaluate possible plans in
queries where enable_sort=off is useful, so I'm going make incremental
sort independent of enable_sort unless there's a strong objection
here.

Tangentially: I'd almost expect enable_incremental_sort to act as a
hard disable (and not even generate the paths) rather than a soft
cost-based disable, since while standard sort is the most basic
operation that needs to always be available as a last resort the same
is not true for incremental sort...

Good point. It's somewhat similar to options like enable_parallel_hash
which also are "hard" switches (i.e. not cost-based penalties).

Another thing we should have is a collection of tests with data sets that
"break" the costing model in some way (skew, correlated columns,
non-uniform group sizes, ...). That's something not meant for commit,
because it'll probably require significant amounts of data, but we need it
to asses the quality of the planner/costing part. I know there are various
ad-hoc test cases in the thread history, it'd be good to consolidate that
into once place.

Agreed.

I'm continuing to work on the planning side of this with the goal of
not needing to modify too many places to consider an incremental sort
path + considering which ones are most likely to be useful, but I
wanted to get my questions about get_useful_ecs_for_relation sent out
while I work on that.

If we end up wanting to limit the number of places we consider
incremental sort (whether for planning time or merely for size of the
initial patch, do you have any thoughts on what general areas we
should consider most valuable? Besides the obvious LIMIT case aother
area that might benefit was min/max, though I'm not sure yet at the
moment if that would really end up meaning considering it all over the
place.

OK, sounds like a plan!

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#151

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#150)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Jul 14, 2019 at 10:16 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Attached is a slightly modified patch series:

1) 0001 considers incremental sort paths in various places (adds the new
generate_useful_gather_paths and modifies places calling create_sort_path)

I need to spend some decent time digesting this patch, but the concept
sounds very useful.

OK.

I've been reading this several times + stepping through with the
debugger to understand when this is useful, but I have a few
questions.

The first case considered in get_useful_pathkeys_for_relation (which
considers root->query_pathkeys) makes a lot of sense -- obviously if
we want sorted results then it's useful to consider both full and
incremental sort.

But I'm not sure I see yet a way to trigger the second case (which
uses get_useful_ecs_for_relation to build pathkeys potentially useful
for merge joins). In the FDW case we need to consider it since we want
to avoid local sort and so want to see if the foreign server might be
able to provide us useful presorted data, but in the local case I
don't think that's useful. From what I can tell merge join path
costing internally considers possible sorts of both the inner and
outer input paths, and merge join plan node creation is responsible
for building explicit sort nodes as necessary (i.e., there are no
explicit sort paths created for merge join paths.) That means that,
for example, a query like:

select * from
(select * from j2 order by j2.t, j2.a) j2
join (select * from j1 order by j1.t) j1
on j1.t = j2.t and j1.a = j2.a;

don't consider incremental sort for the merge join path (I disabled
hash joins, nested loops, and full sorts testing that on an empty
table just to easily force a merge join plan). And unfortunately I
don't think there's an easy way to remedy that: from what I can tell
it'd be a pretty invasive patch requiring refactoring merge join
costing to consider both kinds of sorting (in the most simple
implementation that would mean considering up to 4x as many merge join
paths -- inner/outer sorted by: full/full, full/incremental,
incremental/full, and incremental/incremental). Given that's a
significant undertaking on its own, I think I'm going to avoid
addressing it as part of this patch.

If it's true that the get_useful_ecs_for_relation part of that logic
isn't actually exercisable currently, that that would cut down
significantly on the amount of code that needs to be added for
consideration of valid gather merge paths. But if you can think of a
counterexample, please let me know.

It's quite possible parts of the code are not needed - I've pretty much
just copied the code and did minimal changes to get it working.

That being said, it's not clear to me why plans like this would not be
useful (or why would it require changes to merge join costing):

Merge Join
-> Gather Merge
-> Incremental Sort
-> Gather Merge
-> Incremental Sort

But I have not checked the code, so maybe it would, in which case I
think it's OK to skip it in this patch (or at least in v1 of it).

Someone could correct me if I'm wrong, but I've noticed some comments
that seem to imply we avoid plans with multiple parallel gather
[merge]s; i.e., we try to put a single parallel node as high as
possible in the plan. I assume that's to avoid multiplying out the
number of workers we might consume. And in the sample query above that
kind of plan never got considered because there were no partial paths
to loop through (I'm not sure I fully understand why) when the new
code is called from under apply_scanjoin_target_to_paths().

Of course, we should consider non-parallel incremental sort inputs to
merge joins...but as I noted that's a lot of extra work...

FWIW I'm not sure it's a good idea to look at both enable_incremental_sort
and enable_sort in cost_incremental_sort(). Not only end up with
disable_cost twice when both GUCs are set to 'off' at the moment, but it
might be useful to be able to disable those two sort types independently.
For example you might set just enable_sort=off and we'd still generate
incremental sort paths.

That would cover the usage case I was getting at. Having enable_sort
disable incremental sort also came in without much explanation [1]:

On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <
a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:

Also some other changes from me:
...
enable_sort should act as a cost-based soft disable for both incremental
and normal sort.

I wasn't sure that fully made sense to me, but was assuming the idea
was to effectively not introduce a regression for anyone already
disabling sort to force a specific plan shape. That being said, any
new execution node/planning feature can cause a regression in such
"hinted" queries, so I'm not sure that's a good reason on its own. In
any case, incremental sort is different enough in performance
qualities that I think you'd want to re-evaluate possible plans in
queries where enable_sort=off is useful, so I'm going make incremental
sort independent of enable_sort unless there's a strong objection
here.

OK

Tangentially: I'd almost expect enable_incremental_sort to act as a
hard disable (and not even generate the paths) rather than a soft
cost-based disable, since while standard sort is the most basic
operation that needs to always be available as a last resort the same
is not true for incremental sort...

Good point. It's somewhat similar to options like enable_parallel_hash
which also are "hard" switches (i.e. not cost-based penalties).

I assume the proper approach here then is to check the GUC before
building and adding the path?

If we end up wanting to limit the number of places we consider
incremental sort (whether for planning time or merely for size of the
initial patch, do you have any thoughts on what general areas we
should consider most valuable? Besides the obvious LIMIT case aother
area that might benefit was min/max, though I'm not sure yet at the
moment if that would really end up meaning considering it all over the
place.

OK, sounds like a plan!

Did you have any thoughts on the question about where this is likely
to be most valuable?

James Coleman

#152

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#151)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jul 15, 2019 at 09:25:32AM -0400, James Coleman wrote:

On Sun, Jul 14, 2019 at 10:16 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Attached is a slightly modified patch series:

1) 0001 considers incremental sort paths in various places (adds the new
generate_useful_gather_paths and modifies places calling create_sort_path)

I need to spend some decent time digesting this patch, but the concept
sounds very useful.

OK.

I've been reading this several times + stepping through with the
debugger to understand when this is useful, but I have a few
questions.

The first case considered in get_useful_pathkeys_for_relation (which
considers root->query_pathkeys) makes a lot of sense -- obviously if
we want sorted results then it's useful to consider both full and
incremental sort.

But I'm not sure I see yet a way to trigger the second case (which
uses get_useful_ecs_for_relation to build pathkeys potentially useful
for merge joins). In the FDW case we need to consider it since we want
to avoid local sort and so want to see if the foreign server might be
able to provide us useful presorted data, but in the local case I
don't think that's useful. From what I can tell merge join path
costing internally considers possible sorts of both the inner and
outer input paths, and merge join plan node creation is responsible
for building explicit sort nodes as necessary (i.e., there are no
explicit sort paths created for merge join paths.) That means that,
for example, a query like:

select * from
(select * from j2 order by j2.t, j2.a) j2
join (select * from j1 order by j1.t) j1
on j1.t = j2.t and j1.a = j2.a;

don't consider incremental sort for the merge join path (I disabled
hash joins, nested loops, and full sorts testing that on an empty
table just to easily force a merge join plan). And unfortunately I
don't think there's an easy way to remedy that: from what I can tell
it'd be a pretty invasive patch requiring refactoring merge join
costing to consider both kinds of sorting (in the most simple
implementation that would mean considering up to 4x as many merge join
paths -- inner/outer sorted by: full/full, full/incremental,
incremental/full, and incremental/incremental). Given that's a
significant undertaking on its own, I think I'm going to avoid
addressing it as part of this patch.

If it's true that the get_useful_ecs_for_relation part of that logic
isn't actually exercisable currently, that that would cut down
significantly on the amount of code that needs to be added for
consideration of valid gather merge paths. But if you can think of a
counterexample, please let me know.

It's quite possible parts of the code are not needed - I've pretty much
just copied the code and did minimal changes to get it working.

That being said, it's not clear to me why plans like this would not be
useful (or why would it require changes to merge join costing):

Merge Join
-> Gather Merge
-> Incremental Sort
-> Gather Merge
-> Incremental Sort

But I have not checked the code, so maybe it would, in which case I
think it's OK to skip it in this patch (or at least in v1 of it).

Someone could correct me if I'm wrong, but I've noticed some comments
that seem to imply we avoid plans with multiple parallel gather
[merge]s; i.e., we try to put a single parallel node as high as
possible in the plan. I assume that's to avoid multiplying out the
number of workers we might consume. And in the sample query above that
kind of plan never got considered because there were no partial paths
to loop through (I'm not sure I fully understand why) when the new
code is called from under apply_scanjoin_target_to_paths().

I think you're probably right in this case it'd be more efficient to just
do one Gather Merge on top of the merge join. And yes - we try to put the
Gather node as high as possible, to parallelize as large part of a query
as possible. Keeping the number of necessary workers low is one reason,
but it's also about the serial part in Amdahl's law.

Of course, we should consider non-parallel incremental sort inputs to
merge joins...but as I noted that's a lot of extra work...

OK, understood. I agree such plans would be useful, but if you think it'd
be a lot of extra work, then we can leave it out for now.

Although, looking at the mergejoin path construction code, it does not
seem very complex. Essentially, generate_mergejoin_paths() would need to
consider adding incremental sort on inner/outer path. It would need some
refactoring, but it's not clear to me why would this need changes to
costing? Essentially, we'd produce multiple mergejoin paths that would be
costed by the current code.

That being said, I'm perfectly fine with ignoring this for now. There are
more fundamental bits that we need to tackle first.

FWIW I'm not sure it's a good idea to look at both enable_incremental_sort
and enable_sort in cost_incremental_sort(). Not only end up with
disable_cost twice when both GUCs are set to 'off' at the moment, but it
might be useful to be able to disable those two sort types independently.
For example you might set just enable_sort=off and we'd still generate
incremental sort paths.

That would cover the usage case I was getting at. Having enable_sort
disable incremental sort also came in without much explanation [1]:

On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <
a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:

Also some other changes from me:
...
enable_sort should act as a cost-based soft disable for both incremental
and normal sort.

I wasn't sure that fully made sense to me, but was assuming the idea
was to effectively not introduce a regression for anyone already
disabling sort to force a specific plan shape. That being said, any
new execution node/planning feature can cause a regression in such
"hinted" queries, so I'm not sure that's a good reason on its own. In
any case, incremental sort is different enough in performance
qualities that I think you'd want to re-evaluate possible plans in
queries where enable_sort=off is useful, so I'm going make incremental
sort independent of enable_sort unless there's a strong objection
here.

OK

Tangentially: I'd almost expect enable_incremental_sort to act as a
hard disable (and not even generate the paths) rather than a soft
cost-based disable, since while standard sort is the most basic
operation that needs to always be available as a last resort the same
is not true for incremental sort...

Good point. It's somewhat similar to options like enable_parallel_hash
which also are "hard" switches (i.e. not cost-based penalties).

I assume the proper approach here then is to check the GUC before
building and adding the path?

Yeah. The simplest way to do that might be setting the number of presorted
keys to 0 in pathkeys_common_contained_in(). That effectively says it
makes no sense to do incremental sort. It's a bit too deep, though - not
sure it does not affect any other plans, though.

Or you might reference the GUC in every place that considers incremental
sort, but that's going to be a lot of places.

If we end up wanting to limit the number of places we consider
incremental sort (whether for planning time or merely for size of the
initial patch, do you have any thoughts on what general areas we
should consider most valuable? Besides the obvious LIMIT case aother
area that might benefit was min/max, though I'm not sure yet at the
moment if that would really end up meaning considering it all over the
place.

OK, sounds like a plan!

Did you have any thoughts on the question about where this is likely
to be most valuable?

Good question. I can certainly list some generic cases where I'd expect
incremental sort to be most beneficial:

1) (ORDER BY + LIMIT) - in this case the main advantage is low startup
cost (compared to explicit sort, which has to fetch/sort everything)

2) ORDER BY with on-disk sort - in this case the main advantage is ability
to sort data in small chunks that fit in memory, instead of flushing large
amounts of data into temporary files

Of course, (2) helps with any operation that can leverage the ordering,
i.e. aggregation/...

But the question is how to map this to places in the source code. I don't
have a very good answer to that :-(

IMO the best thing we can do is get some realistic queries, and address
them first. And then eventually add incremental sort to some other places.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#153

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#143)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Now, consider this example:

create table t (a int, b int, c int);
insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
create index on t (a);
analyze t;
explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;

With 0001+0002+0003 pathes, I get a plan like this:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=10375.39..10594.72 rows=1 width=16)
-> Incremental Sort (cost=10375.39..2203675.71 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=10156.07..2203225.71 rows=10000 width=16)
Group Key: a, b
-> Gather Merge (cost=10156.07..2128124.39 rows=10000175 width=12)
Workers Planned: 2
-> Incremental Sort (cost=9156.04..972856.05 rows=4166740 width=12)
Sort Key: a, b
Presorted Key: a
-> Parallel Index Scan using t_a_idx on t (cost=0.43..417690.30 rows=4166740 width=12)
(12 rows)

and with 0004, I get this:

QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=20443.84..20665.32 rows=1 width=16)
-> Incremental Sort (cost=20443.84..2235250.05 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=20222.37..2234800.05 rows=10000 width=16)
Group Key: a, b
-> Incremental Sort (cost=20222.37..2159698.74 rows=10000175 width=12)
Sort Key: a, b
Presorted Key: a
-> Index Scan using t_a_idx on t (cost=0.43..476024.65 rows=10000175 width=12)
(10 rows)

Notice that cost of the second plan is almost double the first one. That
means 0004 does not even generate the first plan, i.e. there are cases
where we don't try to add the explicit sort before passing the path to
generate_gather_paths().

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.

I've been working on figuring out sample queries for each of the
places we're looking at adding create_increment_sort() (starting with
the cases enabled by gather-merge nodes). The
generate_useful_gather_paths() call in
apply_scanjoin_target_to_paths() is required to generate the above
preferred plan.

But I found that if I set enable_sort=off (with our without the
_useful_ variant of generate_gather_paths()) I get a very similar plan
that's actually lower cost yet:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Limit (cost=10255.98..10355.77 rows=1 width=16)
-> Incremental Sort (cost=10255.98..1008228.67 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> Finalize GroupAggregate (cost=10156.20..1007778.67
rows=10000 width=16)
Group Key: a, b
-> Gather Merge (cost=10156.20..1007528.67 rows=20000 width=16)
Workers Planned: 2
-> Partial GroupAggregate
(cost=9156.18..1004220.15 rows=10000 width=16)
Group Key: a, b
-> Incremental Sort
(cost=9156.18..972869.60 rows=4166740 width=12)
Sort Key: a, b
Presorted Key: a
-> Parallel Index Scan using t_a_idx
on t (cost=0.43..417703.85 rows=4166740 width=12)

Is that something we should consider a bug at this stage? It's also
not clear to mean (costing aside) which plan is intuitively
preferable.

James

#154

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#153)

2 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Jul 19, 2019 at 04:59:21PM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Now, consider this example:

create table t (a int, b int, c int);
insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
create index on t (a);
analyze t;
explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;

With 0001+0002+0003 pathes, I get a plan like this:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=10375.39..10594.72 rows=1 width=16)
-> Incremental Sort (cost=10375.39..2203675.71 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=10156.07..2203225.71 rows=10000 width=16)
Group Key: a, b
-> Gather Merge (cost=10156.07..2128124.39 rows=10000175 width=12)
Workers Planned: 2
-> Incremental Sort (cost=9156.04..972856.05 rows=4166740 width=12)
Sort Key: a, b
Presorted Key: a
-> Parallel Index Scan using t_a_idx on t (cost=0.43..417690.30 rows=4166740 width=12)
(12 rows)

and with 0004, I get this:

QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=20443.84..20665.32 rows=1 width=16)
-> Incremental Sort (cost=20443.84..2235250.05 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=20222.37..2234800.05 rows=10000 width=16)
Group Key: a, b
-> Incremental Sort (cost=20222.37..2159698.74 rows=10000175 width=12)
Sort Key: a, b
Presorted Key: a
-> Index Scan using t_a_idx on t (cost=0.43..476024.65 rows=10000175 width=12)
(10 rows)

Notice that cost of the second plan is almost double the first one. That
means 0004 does not even generate the first plan, i.e. there are cases
where we don't try to add the explicit sort before passing the path to
generate_gather_paths().

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.

I've been working on figuring out sample queries for each of the
places we're looking at adding create_increment_sort() (starting with
the cases enabled by gather-merge nodes). The
generate_useful_gather_paths() call in
apply_scanjoin_target_to_paths() is required to generate the above
preferred plan.

But I found that if I set enable_sort=off (with our without the
_useful_ variant of generate_gather_paths()) I get a very similar plan
that's actually lower cost yet:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Limit (cost=10255.98..10355.77 rows=1 width=16)
-> Incremental Sort (cost=10255.98..1008228.67 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> Finalize GroupAggregate (cost=10156.20..1007778.67
rows=10000 width=16)
Group Key: a, b
-> Gather Merge (cost=10156.20..1007528.67 rows=20000 width=16)
Workers Planned: 2
-> Partial GroupAggregate
(cost=9156.18..1004220.15 rows=10000 width=16)
Group Key: a, b
-> Incremental Sort
(cost=9156.18..972869.60 rows=4166740 width=12)
Sort Key: a, b
Presorted Key: a
-> Parallel Index Scan using t_a_idx
on t (cost=0.43..417703.85 rows=4166740 width=12)

Is that something we should consider a bug at this stage? It's also
not clear to mean (costing aside) which plan is intuitively
preferable.

This seems like a thinko in add_partial_path() - it only looks at total
cost of the paths, and ignores the startup cost entirely. I've debugged
it a bit, and what's happening for the partially-grouped relation is
roughly this:

1) We add a partial path with startup/total costs 696263 / 738029

2) We attempt to add the "Partial GroupAggregate" path, but it loses the
fight because the total cost (1004207) is higher than the first path.
Which entirely misses that the startup cost is way lower.

3) We however use the startup cost later when computing the LIMIT cost
(because that's linear approximation between startup and total cost),
and we reject the first path too, because we happen to find something
cheaper (but more expensive than what we'd get the path rejected in 2).

Attached is a debug patch which makes this clear - it only prints info
about first step of partial aggregates, because that's what matters in
this example. You should see something like this:

WARNING: rel 0x2aa8e00 adding partial agg path 0x2aa9448 startup 696263.839993 total 738029.919993
WARNING: rel 0x2aa8e00 path 0x2aa9448 adding new = 1
WARNING: rel 0x2aa8e00 adding partial agg path 0x2aa9710 startup 9156.084995 total 1004207.854495
WARNING: A: new path 0x2aa9710 rejected because of 0x2aa9448
WARNING: rel 0x2aa8e00 path 0x2aa9710 adding new = 0

which essentially says "path rejected because of total cost" (and you
know it's the interesting partial aggregate from the second plan). And
if you disable sort, you get this:

WARNING: rel 0x2aa8e00 adding partial agg path 0x2aa9448 startup 10000696263.839994 total 10000738029.919996
WARNING: rel 0x2aa8e00 path 0x2aa9448 adding new = 1
WARNING: rel 0x2aa8e00 adding partial agg path 0x2aa9710 startup 9156.084995 total 1004207.854495
WARNING: rel 0x2aa8e00 path 0x2aa9710 adding new = 1
...

So in this case we decided the path is interesting, thanks to the
increased cost of sort.

The comment for add_partial_path says this:

* Neither do we need to consider startup costs: parallelism is only
* used for plans that will be run to completion. Therefore, this
* routine is much simpler than add_path: it needs to consider only
* pathkeys and total cost.

I think this may be a thinko, as this plan demonstrates - but I'm not
sure about it. I wonder if this might be penalizing some other types of
plans (essentially anything with limit + gather).

Attached is a WIP patch fixing this by considering both startup and
total cost (by calling compare_path_costs_fuzzily).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

incremental-sort-cost-debug.patchtext/plain; charset=us-asciiDownload

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1363b9bd45..4fd8a77f48 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1833,9 +1833,6 @@ cost_incremental_sort(Path *path,
 
 	Assert(presorted_keys != 0);
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
 	if (!enable_incrementalsort)
 		startup_cost += disable_cost;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 704cfd032d..f920fbff30 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -762,6 +762,15 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	/* Relation should be OK for parallelism, too. */
 	Assert(parent_rel->consider_parallel);
 
+	if (IsA(new_path, AggPath))
+	{
+		AggPath *path = (AggPath *) new_path;
+		/* we only care about first step of partial aggregates here */
+		if (path->aggsplit == AGGSPLIT_INITIAL_SERIAL)
+			elog(WARNING, "rel %p adding partial agg path %p startup %f total %f", 
+				 parent_rel, path, new_path->startup_cost, new_path->total_cost);
+	}
+
 	/*
 	 * As in add_path, throw out any paths which are dominated by the new
 	 * path, but throw out the new path if some existing path dominates it.
@@ -782,7 +791,10 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 			{
 				/* New path costs more; keep it only if pathkeys are better. */
 				if (keyscmp != PATHKEYS_BETTER1)
+				{
+					elog(WARNING, "A: new path %p rejected because of %p", new_path, old_path);
 					accept_new = false;
+				}
 			}
 			else if (old_path->total_cost > new_path->total_cost
 					 * STD_FUZZ_FACTOR)
@@ -799,6 +811,7 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 			else if (keyscmp == PATHKEYS_BETTER2)
 			{
 				/* Costs are about the same, old path has better pathkeys. */
+				elog(WARNING, "B: new path %p rejected because of %p", new_path, old_path);
 				accept_new = false;
 			}
 			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
@@ -813,6 +826,7 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 				 * cheaper.
 				 */
 				accept_new = false;
+				elog(WARNING, "C: new path %p rejected because of %p", new_path, old_path);
 			}
 		}
 
@@ -841,6 +855,8 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 			break;
 	}
 
+	elog(WARNING, "rel %p path %p adding new = %d", parent_rel, new_path, accept_new);
+
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place */

add_partial_path-fix.patchtext/plain; charset=us-asciiDownload

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 0ac73984d2..66afbd93ac 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -778,41 +807,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Unless pathkeys are incompable, keep just one of the two paths. */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}

#155

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#154)

2 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Jul 20, 2019 at 9:22 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Fri, Jul 19, 2019 at 04:59:21PM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Now, consider this example:

create table t (a int, b int, c int);
insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
create index on t (a);
analyze t;
explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;

With 0001+0002+0003 pathes, I get a plan like this:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=10375.39..10594.72 rows=1 width=16)
-> Incremental Sort (cost=10375.39..2203675.71 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=10156.07..2203225.71 rows=10000 width=16)
Group Key: a, b
-> Gather Merge (cost=10156.07..2128124.39 rows=10000175 width=12)
Workers Planned: 2
-> Incremental Sort (cost=9156.04..972856.05 rows=4166740 width=12)
Sort Key: a, b
Presorted Key: a
-> Parallel Index Scan using t_a_idx on t (cost=0.43..417690.30 rows=4166740 width=12)
(12 rows)

and with 0004, I get this:

QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=20443.84..20665.32 rows=1 width=16)
-> Incremental Sort (cost=20443.84..2235250.05 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=20222.37..2234800.05 rows=10000 width=16)
Group Key: a, b
-> Incremental Sort (cost=20222.37..2159698.74 rows=10000175 width=12)
Sort Key: a, b
Presorted Key: a
-> Index Scan using t_a_idx on t (cost=0.43..476024.65 rows=10000175 width=12)
(10 rows)

Notice that cost of the second plan is almost double the first one. That
means 0004 does not even generate the first plan, i.e. there are cases
where we don't try to add the explicit sort before passing the path to
generate_gather_paths().

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.

I've been working on figuring out sample queries for each of the
places we're looking at adding create_increment_sort() (starting with
the cases enabled by gather-merge nodes). The
generate_useful_gather_paths() call in
apply_scanjoin_target_to_paths() is required to generate the above
preferred plan.

As I continue this, I've added a couple of test cases (notably for
generate_useful_gather_paths() in both standard_join_search() and
apply_scanjoin_target_to_paths()). Those, plus the current WIP state
of my hacking on your patch adding generate_useful_gather_paths() is
attached as 0001-parallel-and-more-paths.patch.

My current line of investigation is whether we need to do anything in
the parallel portion of create_ordered_paths(). I noticed that the
first-pass patch adding generate_useful_gather_paths() modified that
section but wasn't actually adding any new gather-merge paths (just
bare incremental sort paths). That seems pretty clearly just a
prototype miss, so I modified the prototype to build gather-merge
paths instead (as a side note that change seems to fix an oddity I was
seeing where plans would include a parallel index scan node even
though they weren't parallel plans). While the resulting plan for
something like:

explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
t.a, t.b limit 50;

changes cost (to be cheaper) ever so slightly with the gather-merge
addition to create_ordered_paths(), the plan itself is otherwise
identical (including row estimates):

Limit
-> Gather Merge
-> Incremental Sort
-> Parallel Index Scan

(Note: I'm forcing parallel plans here with: set
max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
set parallel_tuple_cost=0; set parallel_setup_cost=0; set
min_parallel_index_scan_size=0;)

I can't seem to come up with a case where adding these gather-merge
paths in create_ordered_paths() isn't entirely duplicative of paths
already created by generate_useful_gather_paths() as called from
apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
that both apply_scanjoin_target_to_paths() and create_ordered_paths()
are called by grouping_planner().

Can you think of a case I'm missing here that would make it valuable
to generate new parallel plans in create_ordered_paths()?

But I found that if I set enable_sort=off (with our without the
_useful_ variant of generate_gather_paths()) I get a very similar plan
that's actually lower cost yet:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Limit (cost=10255.98..10355.77 rows=1 width=16)
-> Incremental Sort (cost=10255.98..1008228.67 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> Finalize GroupAggregate (cost=10156.20..1007778.67
rows=10000 width=16)
Group Key: a, b
-> Gather Merge (cost=10156.20..1007528.67 rows=20000 width=16)
Workers Planned: 2
-> Partial GroupAggregate
(cost=9156.18..1004220.15 rows=10000 width=16)
Group Key: a, b
-> Incremental Sort
(cost=9156.18..972869.60 rows=4166740 width=12)
Sort Key: a, b
Presorted Key: a
-> Parallel Index Scan using t_a_idx
on t (cost=0.43..417703.85 rows=4166740 width=12)

Is that something we should consider a bug at this stage? It's also
not clear to mean (costing aside) which plan is intuitively
preferable.

This seems like a thinko in add_partial_path() - it only looks at total
cost of the paths, and ignores the startup cost entirely. I've debugged
it a bit, and what's happening for the partially-grouped relation is
roughly this:

1) We add a partial path with startup/total costs 696263 / 738029

2) We attempt to add the "Partial GroupAggregate" path, but it loses the
fight because the total cost (1004207) is higher than the first path.
Which entirely misses that the startup cost is way lower.

3) We however use the startup cost later when computing the LIMIT cost
(because that's linear approximation between startup and total cost),
and we reject the first path too, because we happen to find something
cheaper (but more expensive than what we'd get the path rejected in 2).

Attached is a debug patch which makes this clear - it only prints info
about first step of partial aggregates, because that's what matters in
this example. You should see something like this:

WARNING: rel 0x2aa8e00 adding partial agg path 0x2aa9448 startup 696263.839993 total 738029.919993
WARNING: rel 0x2aa8e00 path 0x2aa9448 adding new = 1
WARNING: rel 0x2aa8e00 adding partial agg path 0x2aa9710 startup 9156.084995 total 1004207.854495
WARNING: A: new path 0x2aa9710 rejected because of 0x2aa9448
WARNING: rel 0x2aa8e00 path 0x2aa9710 adding new = 0

which essentially says "path rejected because of total cost" (and you
know it's the interesting partial aggregate from the second plan). And
if you disable sort, you get this:

WARNING: rel 0x2aa8e00 adding partial agg path 0x2aa9448 startup 10000696263.839994 total 10000738029.919996
WARNING: rel 0x2aa8e00 path 0x2aa9448 adding new = 1
WARNING: rel 0x2aa8e00 adding partial agg path 0x2aa9710 startup 9156.084995 total 1004207.854495
WARNING: rel 0x2aa8e00 path 0x2aa9710 adding new = 1
...

So in this case we decided the path is interesting, thanks to the
increased cost of sort.

The comment for add_partial_path says this:

* Neither do we need to consider startup costs: parallelism is only
* used for plans that will be run to completion. Therefore, this
* routine is much simpler than add_path: it needs to consider only
* pathkeys and total cost.

I think this may be a thinko, as this plan demonstrates - but I'm not
sure about it. I wonder if this might be penalizing some other types of
plans (essentially anything with limit + gather).

Attached is a WIP patch fixing this by considering both startup and
total cost (by calling compare_path_costs_fuzzily).

It seems to me that this is likely a bug, and not just a changed
needed for this. Do you think it's better addressed in a separate
thread? Or retain it as part of this patch for now (and possibly break
it out later)? On the other hand, it's entirely possible that someone
more familiar with parallel plan limitations could explain why the
above comment holds true. That makes me lean towards asking in a new
thread.

I've also attached a new base patch (incremental-sort-30.patch) which
includes some of the other obvious fixes (costing, etc.) that you'd
previously proposed.

James Coleman

Attachments:

incremental-sort-30.patchapplication/octet-stream; name=incremental-sort-30.patchDownload

commit 4b96729b0da92f8a44382762ca21de74393b52a9
Author: jcoleman <james.coleman@getbraintree.com>
Date:   Fri May 31 14:40:17 2019 +0000

    Implement incremental sort
    
    Incremental sort is an optimized variant of multikey sort for cases
    when the input is already sorted by a prefix of the sort keys. For
    example when a sort by (key1, key2 ... keyN) is requested, and the
    input is already sorted by (key1, key2 ... keyM), M < N, we can
    divide the input into groups where keys (key1, ... keyM) are equal,
    and only sort on the remaining columns.
    
    The implemented algorithm operates in two different modes:
      - Fetching a minimum number of tuples without checking prefix key
        group membership and sorting on all columns when safe.
      - Fetching all tuples for a single prefix key group and sorting on
        solely the unsorted columns.
    We always begin in the first mode, and employ a heuristic to switch
    into the second mode if we believe it's beneficial.
    
    Sorting incrementally can potentially use less memory (and possibly
    avoid spilling to disk), avoid fetching and sorting all tuples in the
    dataset (particularly useful when a LIMIT clause has been specified),
    and begin returning tuples before the entire result set is available.
    Small datasets which fit entirely in memory and must be fully realized
    and sorted may be slightly slower, which we reflect in the costing
    implementation.
    
    The hybrid mode approach allows us to optimize for both very small
    groups (where the overhead of a new tuplesort is high) and very large
    groups (where we can lower cost by not having to sort on already sorted
    columns), albeit at some extra cost while switching between modes.
    
    Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..9ba845b53a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4368,6 +4368,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 92969636b7..289babc989 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1215,6 +1219,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1841,6 +1848,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2175,12 +2188,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2191,7 +2221,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2215,7 +2245,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2284,7 +2314,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2341,7 +2371,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2354,13 +2384,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2400,9 +2431,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2612,6 +2647,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->fullsort_state != NULL)
+	{
+		/* TODO: is it valid to get space used etc. only once given we re-use the sort? */
+		/* TODO: maybe show average, min, max sort group size? */
+
+		Tuplesortstate *fullsort_state = incrsortstate->fullsort_state;
+		TuplesortInstrumentation fullsort_stats;
+		const char *fullsort_sortMethod;
+		const char *fullsort_spaceType;
+		Tuplesortstate *prefixsort_state = incrsortstate->prefixsort_state;
+		TuplesortInstrumentation prefixsort_stats;
+		const char *prefixsort_sortMethod;
+		const char *prefixsort_spaceType;
+
+		tuplesort_get_stats(fullsort_state, &fullsort_stats);
+		fullsort_sortMethod = tuplesort_method_name(fullsort_stats.sortMethod);
+		fullsort_spaceType = tuplesort_space_type_name(fullsort_stats.spaceType);
+		if (prefixsort_state != NULL)
+		{
+			tuplesort_get_stats(prefixsort_state, &prefixsort_stats);
+			prefixsort_sortMethod = tuplesort_method_name(prefixsort_stats.sortMethod);
+			prefixsort_spaceType = tuplesort_space_type_name(prefixsort_stats.spaceType);
+		}
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: Full: %s  %s: %ldkB",
+							 fullsort_sortMethod, fullsort_spaceType,
+							 fullsort_stats.spaceUsed);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %s %s: %ldkB\n",
+								 prefixsort_sortMethod, prefixsort_spaceType,
+								 prefixsort_stats.spaceUsed);
+			else
+				appendStringInfo(es->str, "\n");
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: Full:  %ld",
+							 incrsortstate->fullsort_group_count);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %ld\n",
+							 incrsortstate->prefixsort_group_count);
+			else
+				appendStringInfo(es->str, "\n");
+		}
+		else
+		{
+			/* TODO */
+			ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+			ExplainPropertyInteger("Full Sort Space Used", "kB",
+					fullsort_stats.spaceUsed, es);
+			ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+			ExplainPropertyInteger("Full Sort Groups", NULL,
+								   incrsortstate->fullsort_group_count, es);
+
+			if (prefixsort_state != NULL)
+			{
+				ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+				ExplainPropertyInteger("Prefix Sort Space Used", "kB",
+						prefixsort_stats.spaceUsed, es);
+				ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+				ExplainPropertyInteger("Prefix Sort Groups", NULL,
+									   incrsortstate->prefixsort_group_count, es);
+			}
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+				&incrsortstate->shared_info->sinfo[n];
+			TuplesortInstrumentation *fullsort_instrument;
+			const char *fullsort_sortMethod;
+			const char *fullsort_spaceType;
+			long		fullsort_spaceUsed;
+			int64		fullsort_group_count;
+			TuplesortInstrumentation *prefixsort_instrument;
+			const char *prefixsort_sortMethod;
+			const char *prefixsort_spaceType;
+			long		prefixsort_spaceUsed;
+			int64		prefixsort_group_count;
+
+			fullsort_instrument = &incsort_info->fullsort_instrument;
+			fullsort_group_count = incsort_info->fullsort_group_count;
+
+			prefixsort_instrument = &incsort_info->prefixsort_instrument;
+			prefixsort_group_count = incsort_info->prefixsort_group_count;
+
+			if (fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+
+			fullsort_sortMethod = tuplesort_method_name(
+					fullsort_instrument->sortMethod);
+			fullsort_spaceType = tuplesort_space_type_name(
+					fullsort_instrument->spaceType);
+			fullsort_spaceUsed = fullsort_instrument->spaceUsed;
+
+			if (prefixsort_instrument)
+			{
+				prefixsort_sortMethod = tuplesort_method_name(
+						prefixsort_instrument->sortMethod);
+				prefixsort_spaceType = tuplesort_space_type_name(
+						prefixsort_instrument->spaceType);
+				prefixsort_spaceUsed = prefixsort_instrument->spaceUsed;
+			}
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d: Full Sort Method: %s  %s: %ldkB  Groups: %ld",
+								 n, fullsort_sortMethod, fullsort_spaceType,
+								 fullsort_spaceUsed, fullsort_group_count);
+				if (prefixsort_instrument)
+					appendStringInfo(es->str,
+									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+									 prefixsort_sortMethod, prefixsort_spaceType,
+									 prefixsort_spaceUsed, prefixsort_group_count);
+				else
+					appendStringInfo(es->str, "\n");
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+				ExplainPropertyInteger("Full Sort Space Used", "kB", fullsort_spaceUsed, es);
+				ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+				ExplainPropertyInteger("Full Sort Groups", NULL, fullsort_group_count, es);
+				if (prefixsort_instrument)
+				{
+					ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+					ExplainPropertyInteger("Prefix Sort Space Used", "kB", prefixsort_spaceUsed, es);
+					ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+					ExplainPropertyInteger("Prefix Sort Groups", NULL, prefixsort_group_count, es);
+				}
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1f18e5d3a2..8680e7d911 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -254,6 +255,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -559,8 +564,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 53cd2fc666..bf11a08644 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -955,6 +964,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1015,6 +1025,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1301,6 +1314,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c227282975..a9dd08fa6f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -694,6 +700,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -840,6 +850,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState  *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..c3b903e568
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1107 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int presortedCols, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail keys
+	 * are more likely to change. Therefore we do our comparison starting from
+	 * the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(IncrementalSortState *node)
+{
+	ScanDirection		dir;
+	int64 nTuples = 0;
+	bool lastTuple = false;
+	bool firstTuple = true;
+	TupleDesc		    tupDesc;
+	PlanState		   *outerNode;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal
+		 * and thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(
+				tupDesc,
+				plannode->sort.numCols - presortedCols,
+				&(plannode->sort.sortColIdx[presortedCols]),
+				&(plannode->sort.sortOperators[presortedCols]),
+				&(plannode->sort.collations[presortedCols]),
+				&(plannode->sort.nullsFirst[presortedCols]),
+				work_mem,
+				NULL,
+				false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure
+	 * the tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+				node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+					ScanDirectionIsForward(dir),
+					false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to save the
+			 * first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/* The tuple isn't part of the current batch so we need to carry
+				 * it over into the next set up tuples we transfer out of the full
+				 * sort tuplesort into the presorted prefix tuplesort. We don't
+				 * actually have to do anything special to save the tuple since
+				 * we've already loaded it into the node->transfer_tuple slot, and,
+				 * even though that slot points to memory inside the full sort
+				 * tuplesort, we can't reset that tuplesort anyway until we've
+				 * fully transferred out of its tuples, so this reference is safe.
+				 * We do need to reset the group pivot tuple though since we've
+				 * finished the current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch
+		 * is in the same prefix key group and moved all of those tuples into
+		 * the presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/* Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume
+		 * we have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner node
+		 * read out all of those tuples, and then come back around to find
+		 * another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *read_sortstate;
+	Tuplesortstate	   *fullsort_state;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+			|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->finished)
+			/*
+			 * TODO: there isn't a good test case for the node->finished
+			 * case directly, but lots of other stuff fails if it's not
+			 * there. If the outer node will fail when trying to fetch
+			 * too many tuples, then things break if that test isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the full
+			 * sort tuplesort. The first call to switchToPresortedPrefixMode()
+			 * pulled the one of those groups out, and we've returned those
+			 * tuples to the inner node, but if we tuples remaining in that
+			 * tuplesort (i.e., n_fullsort_remaining > 0) at this point we
+			 * need to do that again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(node);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're not
+			 * in the middle of transitioning into presorted prefix sort mode,
+			 * then it's time to start the process all over again by building
+			 * new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(
+					tupDesc,
+					plannode->sort.numCols,
+					plannode->sort.sortColIdx,
+					plannode->sort.sortOperators,
+					plannode->sort.collations,
+					plannode->sort.nullsFirst,
+					work_mem,
+					NULL,
+					false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64 currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n heap
+			 * sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/* Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't assume
+			 * the group pivot tuple will reamin the same -- unless we're using
+			 * a minimum group size of 1, in which case the pivot is obviously
+			 * still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us anymore tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then don't
+				 * both with checking for inclusion in the current prefix group
+				 * since a large number of very tiny sorts is inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this
+					 * sort group. Instead we need to carry it over to the
+					 * next group. We use the group_pivot slot as a temp
+					 * container for that purpose even though we won't actually
+					 * treat it as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound remaining
+						 * is (original bound - n), so store the current number
+						 * of processed tuples for use in configuring sorting
+						 * bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+					tuplesort_performsort(fullsort_state);
+					node->fullsort_group_count++;
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found
+			 * a large group of tuples having a single prefix key (as long
+			 * as the last tuple didn't shift us into reading from the full
+			 * sort mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+					node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into the
+				 * tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n heapsort mode
+				 * then we will only be able to retrieve currentBound tuples (since the
+				 * tuplesort will have only retained the top-n tuples). This is safe even
+				 * though we haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the retained ones,
+				 * and we're already contractually guaranteed to not need any more than the
+				 * currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64 currentBound = node->bound - node->bound_Done;
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the fullsort
+				 * to presorted prefix sort (we might have multiple prefix key
+				 * groups, so we need a way to see if we've actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(node);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop out
+				 * of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've already
+		 * established a current group pivot tuple (but wasn't carried over;
+		 * it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix tuplesort
+				 * until we find changed prefix keys. Only then can we guarantee
+				 * sort stability of the tuples we've already accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		IncrementalSortInfo *incsort_info =
+			&node->shared_info->sinfo[ParallelWorkerNumber];
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		tuplesort_get_stats(fullsort_state, &incsort_info->fullsort_instrument);
+		incsort_info->fullsort_group_count = node->fullsort_group_count;
+
+		if (node->prefixsort_state)
+		{
+			tuplesort_get_stats(node->prefixsort_state,
+					&incsort_info->prefixsort_instrument);
+			incsort_info->prefixsort_group_count = node->prefixsort_group_count;
+		}
+	}
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_group_count = 0;
+	incrsortstate->prefixsort_group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+		tuplesort_end(node->fullsort_state);
+	node->fullsort_state = NULL;
+	if (node->prefixsort_state != NULL)
+		tuplesort_end(node->prefixsort_state);
+	node->prefixsort_state = NULL;
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 92855278ad..3ea1b1bca1 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 78deade89b..de27b06e15 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -921,6 +921,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -932,13 +950,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4900,6 +4934,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8400dd319e..b8c3826a17 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -830,10 +830,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -843,6 +841,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3755,6 +3771,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..9e0d42322c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2114,12 +2114,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2128,6 +2129,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2761,6 +2788,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b7723481b0..3efc807164 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3884,6 +3884,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..022a036be1 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,166 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
 
+	if (!enable_incrementalsort)
+		startup_cost += disable_cost;
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 08b5061612..454c61e1d8 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -332,6 +332,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1791,19 +1836,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int	n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none.
+	 * Any leading common pathkeys could be useful for ordering because
+	 * we can use the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 12fba56285..bfb52f21ab 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -241,6 +243,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -255,6 +261,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -457,6 +465,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1988,6 +2001,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5050,17 +5089,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
 
-	cost_sort(&sort_path, root, NIL,
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5633,9 +5679,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	node = makeNode(Sort);
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5649,6 +5698,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -5995,6 +6075,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6729,6 +6845,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 401299e542..16996b1bc2 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,8 +4922,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4962,29 +4962,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can take
+				 * advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index dc11f098e0..878cb6b934 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -648,6 +648,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index efd0fbc21c..41a5e18195 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2686,6 +2686,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2bb00..91066b238c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2777,6 +2777,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 631f16f5fe..e90692287b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -941,6 +941,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b8e67899e..16098ed8eb 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -243,6 +252,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1222,17 +1249,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1293,7 +1322,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2590,8 +2723,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2641,7 +2773,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3271,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index c119fdf4fa..3e48593543 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 99b9fa414f..42d5a46974 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1924,6 +1924,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfo	fcinfo; /* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -1952,6 +1966,60 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	fullsort_instrument;
+	int64						fullsort_group_count;
+	TuplesortInstrumentation	prefixsort_instrument;
+	int64						prefixsort_group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64			n_fullsort_remaining;
+	Tuplesortstate	   *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate	   *prefixsort_state; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		fullsort_group_count;	/* number of groups with equal presorted keys */
+	int64		prefixsort_group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 4e2fb39105..0500a3199f 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 441e64eca9..9d45feb37b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1614,6 +1614,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 70f8b8e22b..f9baee6495 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -762,6 +762,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..b9d7a77e65 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 182ffeef4b..61c3940921 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..e7a40cec3f 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -183,6 +183,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 4521de18e1..65a73af214 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -216,6 +216,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -240,6 +241,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..3a58efdf91
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1160 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 10349ec29c..5f17afe0eb 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 8fb55f045e..f5f4f7c9f9 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a39ca1012a..7afd0cc373 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b9df37412f
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,78 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.

0001-parallel-and-more-paths.patchapplication/octet-stream; name=0001-parallel-and-more-paths.patchDownload

commit d07b4086918a0d255e02985a1172bcca471efa8b
Author: jcoleman <james.coleman@getbraintree.com>
Date:   Sat Jul 20 14:09:20 2019 +0000

    WIP: Parallel + more create_incremental_sort_paths()

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 3efc807164..c4c6714218 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2730,6 +2730,220 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Pushing the query_pathkeys to the remote server is always worth
+	 * considering, because it might let us avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * The planner and executor don't have any clever strategy for
+			 * taking data sorted by a prefix of the query's pathkeys and
+			 * getting it to be sorted by all of those pathkeys. We'll just
+			 * end up resorting the entire data set.  So, unless we can push
+			 * down all of the query pathkeys, forget it.
+			 *
+			 * is_foreign_expr would detect volatile expressions as well, but
+			 * checking ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		/*
+		 * This ends up allowing us to do incremental sort on top of
+		 * an index scan all parallelized under a gather merge node.
+		*/
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike generate_gather_paths, this does not look just as pathkeys of the
+ * input paths (aiming to preserve the ordering). It also considers ordering
+ * that might be useful by nodes above the gather merge node, and tries to
+ * add a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather merge paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			/* now we know is_sorted == false */
+
+			/*
+			 * consider regular sort for cheapest partial path (for each
+			 * useful pathkeys)
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* continue */
+			}
+
+			/* finally, consider incremental sort */
+			if (presorted_keys > 0)
+			{
+				Path *tmp;
+
+				/* Also consider incremental sort. */
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2902,7 +3116,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index bfb52f21ab..c2877942cb 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -5932,7 +5932,10 @@ prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 				}
 			}
 			if (!j)
-				elog(ERROR, "could not find pathkey item to sort");
+			{
+				elog(WARNING, "could not find pathkey item to sort");
+				Assert(false);
+			}
 
 			/*
 			 * Do we need to insert a Result node?
@@ -6491,7 +6494,10 @@ make_unique_from_pathkeys(Plan *lefttree, List *pathkeys, int numCols)
 		}
 
 		if (!tle)
-			elog(ERROR, "could not find pathkey item to sort");
+		{
+			elog(WARNING, "could not find pathkey item to sort");
+			Assert(false);
+		}
 
 		/*
 		 * Look up the correct equality operator from the PathKey's slightly
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 16996b1bc2..54b244b158 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,8 +4922,9 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new paths we need consider is an explicit full or
- * incremental sort on the cheapest-total existing path.
+ * The only new paths we need consider is an explicit full sort
+ * on the cheapest-total existing path and incremental sort on
+ * partially presorted paths.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -5001,7 +5002,12 @@ create_ordered_paths(PlannerInfo *root,
 			}
 			if (presorted_keys > 0)
 			{
-				/* Also consider incremental sort. */
+				/*
+				 * Also consider incremental sort. Unlike standard sort,
+				 * we don't care about the cheapest input path is; we're
+				 * concerned only with whether the input path is already
+				 * usefully, but partially, sorted.
+				 */
 				sorted_path = (Path *) create_incremental_sort_path(root,
 																	ordered_rel,
 																	input_path,
@@ -5068,6 +5074,62 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/* also consider incremental sorts on all partial paths */
+		{
+			ListCell *lc;
+			foreach (lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/* already handled above */
+				/* if (input_path == cheapest_partial_path) */
+				/* 	continue; */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys, &presorted_keys);
+
+				/* also ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys > 0)
+				{
+					/* Also consider incremental sort. */
+					sorted_path = (Path *) create_incremental_sort_path(root,
+																		ordered_rel,
+																		input_path,
+																		root->sort_pathkeys,
+																		presorted_keys,
+																		limit_tuples);
+					total_groups = input_path->rows *
+						input_path->parallel_workers;
+					sorted_path = (Path *)
+						create_gather_merge_path(root, ordered_rel,
+												 sorted_path,
+												 sorted_path->pathtarget,
+												 root->sort_pathkeys, NULL,
+												 &total_groups);
+
+					/* Add projection step if needed */
+					if (sorted_path->pathtarget != target)
+						sorted_path = apply_projection_to_path(root, ordered_rel,
+															   sorted_path, target);
+
+					/*
+					 * XXX: what case does this cover?
+					 * (or is it entirely duplicative of generate_useful_gather_paths()
+					 * in apply_scanjoin_target_to_paths())
+					 */
+					add_path(ordered_rel, sorted_path);
+				}
+			}
+
+		}
 	}
 
 	/*
@@ -6484,6 +6546,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			}
 		}
 
+
+		/*
+		 * Use any available suitably-sorted path as input, with incremental
+		 * sort path.
+		 */
+		foreach(lc, input_rel->pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+				continue;
+
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
+		}
+
 		/*
 		 * Instead of operating directly on the input relation, we can
 		 * consider finalizing a partially aggregated path.
@@ -6530,6 +6666,53 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   havingQual,
 											   dNumGroups));
 			}
+
+			/* incremental sort */
+			foreach(lc, partially_grouped_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
+			}
+
 		}
 	}
 
@@ -6798,6 +6981,57 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Use any available suitably-sorted path as input, and also consider
+		 * sorting the cheapest partial path.
+		 */
+		foreach(lc, input_rel->pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* also ignore already sorted paths */
+			if (is_sorted)
+				continue;
+
+			if (presorted_keys == 0)
+				continue;
+
+			/* add incremental sort */
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_path(partially_grouped_rel, (Path *)
+						 create_agg_path(root,
+										 partially_grouped_rel,
+										 path,
+										 partially_grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_INITIAL_SERIAL,
+										 parse->groupClause,
+										 NIL,
+										 agg_partial_costs,
+										 dNumPartialGroups));
+			else
+				add_path(partially_grouped_rel, (Path *)
+						 create_group_path(root,
+										   partially_grouped_rel,
+										   path,
+										   parse->groupClause,
+										   NIL,
+										   dNumPartialGroups));
+		}
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6842,6 +7076,52 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   dNumPartialPartialGroups));
 			}
 		}
+
+		/* consider incremental sort */
+		foreach(lc, input_rel->partial_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+				continue;
+
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
+		}
 	}
 
 	if (can_hash && cheapest_total_path != NULL)
@@ -6938,6 +7218,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
@@ -6967,6 +7248,44 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	/* also consider incremental sort on all partial paths */
+	foreach (lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
+
 }
 
 /*
@@ -7222,7 +7541,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index e7a40cec3f..20fa94281b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 9775cc898c..b687c062ec 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -940,6 +940,64 @@ explain (costs off)
          Index Cond: (unique1 = 1)
 (5 rows)
 
+ROLLBACK TO SAVEPOINT settings;
+SAVEPOINT settings;
+set local max_parallel_workers_per_gather=4;
+set local min_parallel_table_scan_size=0;
+set local parallel_tuple_cost=0;
+set local parallel_setup_cost=0;
+-- incremental sort tests
+-- without generate_useful_gather_paths() in apply_scanjoin_target_to_paths()
+-- we don't get the following plan (though regardless of that choice, with
+-- enable_sort=off we get a similar plan, but using:
+-- Finalize GroupAggregate,
+--   -> Gather Merge
+--        -> Partial GroupAggregate
+-- instead of:
+-- GroupAggregate,
+--   -> Gather Merge
+explain (costs off) select hundred, thousand, sum(twenty) from tenk1 group by 1,2 order by 1,2,3 limit 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: hundred, thousand, (sum(twenty))
+         Presorted Key: hundred, thousand
+         ->  GroupAggregate
+               Group Key: hundred, thousand
+               ->  Gather Merge
+                     Workers Planned: 4
+                     ->  Incremental Sort
+                           Sort Key: hundred, thousand
+                           Presorted Key: hundred
+                           ->  Parallel Index Scan using tenk1_hundred on tenk1
+(12 rows)
+
+-- without generate_useful_gather_paths() in standard_join_search()
+-- we don't get the following plan
+explain (costs off) select * from tenk1 t1 join tenk1 t2 on t1.hundred = t2.hundred join tenk1 t3 on t1.hundred = t3.hundred order by t1.hundred, t1.twenty limit 50;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Limit
+   ->  Merge Join
+         Merge Cond: (t1.hundred = t3.hundred)
+         ->  Gather Merge
+               Workers Planned: 4
+               ->  Incremental Sort
+                     Sort Key: t1.hundred, t1.twenty
+                     Presorted Key: t1.hundred
+                     ->  Merge Join
+                           Merge Cond: (t1.hundred = t2.hundred)
+                           ->  Sort
+                                 Sort Key: t1.hundred
+                                 ->  Parallel Seq Scan on tenk1 t1
+                           ->  Sort
+                                 Sort Key: t2.hundred
+                                 ->  Seq Scan on tenk1 t2
+         ->  Materialize
+               ->  Index Scan using tenk1_hundred on tenk1 t3
+(18 rows)
+
 ROLLBACK TO SAVEPOINT settings;
 -- exercise record typmod remapping between backends
 CREATE FUNCTION make_record(n int)
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index f96812b550..9b4d5a5cd8 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -339,6 +339,30 @@ explain (costs off)
   select stringu1::int2 from tenk1 where unique1 = 1;
 ROLLBACK TO SAVEPOINT settings;
 
+
+SAVEPOINT settings;
+set local max_parallel_workers_per_gather=4;
+set local min_parallel_table_scan_size=0;
+set local parallel_tuple_cost=0;
+set local parallel_setup_cost=0;
+
+-- incremental sort tests
+
+-- without generate_useful_gather_paths() in apply_scanjoin_target_to_paths()
+-- we don't get the following plan (though regardless of that choice, with
+-- enable_sort=off we get a similar plan, but using:
+-- Finalize GroupAggregate,
+--   -> Gather Merge
+--        -> Partial GroupAggregate
+-- instead of:
+-- GroupAggregate,
+--   -> Gather Merge
+explain (costs off) select hundred, thousand, sum(twenty) from tenk1 group by 1,2 order by 1,2,3 limit 1;
+-- without generate_useful_gather_paths() in standard_join_search()
+-- we don't get the following plan
+explain (costs off) select * from tenk1 t1 join tenk1 t2 on t1.hundred = t2.hundred join tenk1 t3 on t1.hundred = t3.hundred order by t1.hundred, t1.twenty limit 50;
+ROLLBACK TO SAVEPOINT settings;
+
 -- exercise record typmod remapping between backends
 CREATE FUNCTION make_record(n int)
   RETURNS RECORD LANGUAGE plpgsql PARALLEL SAFE AS

#156

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#155)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Jul 20, 2019 at 10:33:02AM -0400, James Coleman wrote:

On Sat, Jul 20, 2019 at 9:22 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Fri, Jul 19, 2019 at 04:59:21PM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Now, consider this example:

create table t (a int, b int, c int);
insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
create index on t (a);
analyze t;
explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;

With 0001+0002+0003 pathes, I get a plan like this:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=10375.39..10594.72 rows=1 width=16)
-> Incremental Sort (cost=10375.39..2203675.71 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=10156.07..2203225.71 rows=10000 width=16)
Group Key: a, b
-> Gather Merge (cost=10156.07..2128124.39 rows=10000175 width=12)
Workers Planned: 2
-> Incremental Sort (cost=9156.04..972856.05 rows=4166740 width=12)
Sort Key: a, b
Presorted Key: a
-> Parallel Index Scan using t_a_idx on t (cost=0.43..417690.30 rows=4166740 width=12)
(12 rows)

and with 0004, I get this:

QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=20443.84..20665.32 rows=1 width=16)
-> Incremental Sort (cost=20443.84..2235250.05 rows=10000 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
-> GroupAggregate (cost=20222.37..2234800.05 rows=10000 width=16)
Group Key: a, b
-> Incremental Sort (cost=20222.37..2159698.74 rows=10000175 width=12)
Sort Key: a, b
Presorted Key: a
-> Index Scan using t_a_idx on t (cost=0.43..476024.65 rows=10000175 width=12)
(10 rows)

Notice that cost of the second plan is almost double the first one. That
means 0004 does not even generate the first plan, i.e. there are cases
where we don't try to add the explicit sort before passing the path to
generate_gather_paths().

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.

I've been working on figuring out sample queries for each of the
places we're looking at adding create_increment_sort() (starting with
the cases enabled by gather-merge nodes). The
generate_useful_gather_paths() call in
apply_scanjoin_target_to_paths() is required to generate the above
preferred plan.

As I continue this, I've added a couple of test cases (notably for
generate_useful_gather_paths() in both standard_join_search() and
apply_scanjoin_target_to_paths()). Those, plus the current WIP state
of my hacking on your patch adding generate_useful_gather_paths() is
attached as 0001-parallel-and-more-paths.patch.

My current line of investigation is whether we need to do anything in
the parallel portion of create_ordered_paths(). I noticed that the
first-pass patch adding generate_useful_gather_paths() modified that
section but wasn't actually adding any new gather-merge paths (just
bare incremental sort paths). That seems pretty clearly just a
prototype miss, so I modified the prototype to build gather-merge
paths instead (as a side note that change seems to fix an oddity I was
seeing where plans would include a parallel index scan node even
though they weren't parallel plans). While the resulting plan for
something like:

Yes, that seems to be a bug. The block above it clealy has a gather
merge nodes, so this one should too.

explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
t.a, t.b limit 50;

changes cost (to be cheaper) ever so slightly with the gather-merge
addition to create_ordered_paths(), the plan itself is otherwise
identical (including row estimates):

Limit
-> Gather Merge
-> Incremental Sort
-> Parallel Index Scan

(Note: I'm forcing parallel plans here with: set
max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
set parallel_tuple_cost=0; set parallel_setup_cost=0; set
min_parallel_index_scan_size=0;)

I can't seem to come up with a case where adding these gather-merge
paths in create_ordered_paths() isn't entirely duplicative of paths
already created by generate_useful_gather_paths() as called from
apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
that both apply_scanjoin_target_to_paths() and create_ordered_paths()
are called by grouping_planner().

Can you think of a case I'm missing here that would make it valuable
to generate new parallel plans in create_ordered_paths()?

Good question. Not sure. I think such path would need to do something on
a relation that is neither a join nor a scan - in which case the path
should not be created by apply_scanjoin_target_to_paths().

So for example a query like this:

SELECT
a, b, sum(expensive_function(c))
FROM
t
GROUP BY a,b
ORDER BY a,sum(...) LIMIT 10;

should be able to produce a plan like this:

-> Limit
-> Gather Merge
-> Incremental Sort (pathkeys: a, sum)
-> Group Aggregate
a, b, sum(expensive_function(c))
-> Index Scan (pathkeys: a, b)

or something like that, maybe. I haven't tried this, though. The other
question is whether such queries are useful in practice ...

...

I think this may be a thinko, as this plan demonstrates - but I'm not
sure about it. I wonder if this might be penalizing some other types of
plans (essentially anything with limit + gather).

Attached is a WIP patch fixing this by considering both startup and
total cost (by calling compare_path_costs_fuzzily).

It seems to me that this is likely a bug, and not just a changed
needed for this. Do you think it's better addressed in a separate
thread? Or retain it as part of this patch for now (and possibly break
it out later)? On the other hand, it's entirely possible that someone
more familiar with parallel plan limitations could explain why the
above comment holds true. That makes me lean towards asking in a new
thread.

Maybe. I think creating a separate thread would be useful, provided we
manage to demonstrate the issue without an incremental sort.

I've also attached a new base patch (incremental-sort-30.patch) which
includes some of the other obvious fixes (costing, etc.) that you'd
previously proposed.

Thanks!

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#157

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#156)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...

My current line of investigation is whether we need to do anything in
the parallel portion of create_ordered_paths(). I noticed that the
first-pass patch adding generate_useful_gather_paths() modified that
section but wasn't actually adding any new gather-merge paths (just
bare incremental sort paths). That seems pretty clearly just a
prototype miss, so I modified the prototype to build gather-merge
paths instead (as a side note that change seems to fix an oddity I was
seeing where plans would include a parallel index scan node even
though they weren't parallel plans). While the resulting plan for
something like:

Yes, that seems to be a bug. The block above it clealy has a gather
merge nodes, so this one should too.

explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
t.a, t.b limit 50;

changes cost (to be cheaper) ever so slightly with the gather-merge
addition to create_ordered_paths(), the plan itself is otherwise
identical (including row estimates):

Limit
-> Gather Merge
-> Incremental Sort
-> Parallel Index Scan

(Note: I'm forcing parallel plans here with: set
max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
set parallel_tuple_cost=0; set parallel_setup_cost=0; set
min_parallel_index_scan_size=0;)

I can't seem to come up with a case where adding these gather-merge
paths in create_ordered_paths() isn't entirely duplicative of paths
already created by generate_useful_gather_paths() as called from
apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
that both apply_scanjoin_target_to_paths() and create_ordered_paths()
are called by grouping_planner().

Can you think of a case I'm missing here that would make it valuable
to generate new parallel plans in create_ordered_paths()?

Good question. Not sure. I think such path would need to do something on
a relation that is neither a join nor a scan - in which case the path
should not be created by apply_scanjoin_target_to_paths().

So for example a query like this:

SELECT
a, b, sum(expensive_function(c))
FROM
t
GROUP BY a,b
ORDER BY a,sum(...) LIMIT 10;

should be able to produce a plan like this:

-> Limit
-> Gather Merge
-> Incremental Sort (pathkeys: a, sum)
-> Group Aggregate
a, b, sum(expensive_function(c))
-> Index Scan (pathkeys: a, b)

or something like that, maybe. I haven't tried this, though. The other
question is whether such queries are useful in practice ...

Hmm, when I step through on that example input_rel->partial_pathlist
!= NIL is false, so we don't even attempt to consider any extra
parallel paths in create_ordered_paths(). Nonetheless we get a
parallel plan, but with a different shape:

Limit
-> Incremental Sort
Sort Key: a, (sum(expensive_function(c)))
Presorted Key: a
-> Finalize GroupAggregate
Group Key: a, b
-> Gather Merge
Workers Planned: 4
-> Partial GroupAggregate
Group Key: a, b
-> Sort
Sort Key: a, b
-> Parallel Seq Scan on t

(or if I disable seqscan then the sort+seq scan above becomes inc sort
+ index scan)

To be honest, I don't think I understand how you would get a plan like
that anyway since the index here only has "a" and so can't provide
(pathkeys: a, b).

Perhaps there's something I'm still missing though.

James Coleman

#158

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: James Coleman (#157)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Jul 20, 2019 at 12:12 PM James Coleman <jtc331@gmail.com> wrote:

On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...

My current line of investigation is whether we need to do anything in
the parallel portion of create_ordered_paths(). I noticed that the
first-pass patch adding generate_useful_gather_paths() modified that
section but wasn't actually adding any new gather-merge paths (just
bare incremental sort paths). That seems pretty clearly just a
prototype miss, so I modified the prototype to build gather-merge
paths instead (as a side note that change seems to fix an oddity I was
seeing where plans would include a parallel index scan node even
though they weren't parallel plans). While the resulting plan for
something like:

Yes, that seems to be a bug. The block above it clealy has a gather
merge nodes, so this one should too.

explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
t.a, t.b limit 50;

changes cost (to be cheaper) ever so slightly with the gather-merge
addition to create_ordered_paths(), the plan itself is otherwise
identical (including row estimates):

Limit
-> Gather Merge
-> Incremental Sort
-> Parallel Index Scan

(Note: I'm forcing parallel plans here with: set
max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
set parallel_tuple_cost=0; set parallel_setup_cost=0; set
min_parallel_index_scan_size=0;)

I can't seem to come up with a case where adding these gather-merge
paths in create_ordered_paths() isn't entirely duplicative of paths
already created by generate_useful_gather_paths() as called from
apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
that both apply_scanjoin_target_to_paths() and create_ordered_paths()
are called by grouping_planner().

Can you think of a case I'm missing here that would make it valuable
to generate new parallel plans in create_ordered_paths()?

Good question. Not sure. I think such path would need to do something on
a relation that is neither a join nor a scan - in which case the path
should not be created by apply_scanjoin_target_to_paths().

So for example a query like this:

SELECT
a, b, sum(expensive_function(c))
FROM
t
GROUP BY a,b
ORDER BY a,sum(...) LIMIT 10;

should be able to produce a plan like this:

-> Limit
-> Gather Merge
-> Incremental Sort (pathkeys: a, sum)
-> Group Aggregate
a, b, sum(expensive_function(c))
-> Index Scan (pathkeys: a, b)

or something like that, maybe. I haven't tried this, though. The other
question is whether such queries are useful in practice ...

Hmm, when I step through on that example input_rel->partial_pathlist
!= NIL is false, so we don't even attempt to consider any extra
parallel paths in create_ordered_paths(). Nonetheless we get a
parallel plan, but with a different shape:

Limit
-> Incremental Sort
Sort Key: a, (sum(expensive_function(c)))
Presorted Key: a
-> Finalize GroupAggregate
Group Key: a, b
-> Gather Merge
Workers Planned: 4
-> Partial GroupAggregate
Group Key: a, b
-> Sort
Sort Key: a, b
-> Parallel Seq Scan on t

(or if I disable seqscan then the sort+seq scan above becomes inc sort
+ index scan)

To be honest, I don't think I understand how you would get a plan like
that anyway since the index here only has "a" and so can't provide
(pathkeys: a, b).

Perhaps there's something I'm still missing though.

Also just realized I don't think (?) we can order by the sum inside a
gather-merge -- at least not without having another sort above the
parallel portion? Or is the group aggregate able to also provide
ordering on the final sum after aggregating the partial sums?

James Coleman

#159

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#158)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Jul 20, 2019 at 12:21:01PM -0400, James Coleman wrote:

On Sat, Jul 20, 2019 at 12:12 PM James Coleman <jtc331@gmail.com> wrote:

On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...

My current line of investigation is whether we need to do anything in
the parallel portion of create_ordered_paths(). I noticed that the
first-pass patch adding generate_useful_gather_paths() modified that
section but wasn't actually adding any new gather-merge paths (just
bare incremental sort paths). That seems pretty clearly just a
prototype miss, so I modified the prototype to build gather-merge
paths instead (as a side note that change seems to fix an oddity I was
seeing where plans would include a parallel index scan node even
though they weren't parallel plans). While the resulting plan for
something like:

Yes, that seems to be a bug. The block above it clealy has a gather
merge nodes, so this one should too.

explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
t.a, t.b limit 50;

changes cost (to be cheaper) ever so slightly with the gather-merge
addition to create_ordered_paths(), the plan itself is otherwise
identical (including row estimates):

Limit
-> Gather Merge
-> Incremental Sort
-> Parallel Index Scan

(Note: I'm forcing parallel plans here with: set
max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
set parallel_tuple_cost=0; set parallel_setup_cost=0; set
min_parallel_index_scan_size=0;)

I can't seem to come up with a case where adding these gather-merge
paths in create_ordered_paths() isn't entirely duplicative of paths
already created by generate_useful_gather_paths() as called from
apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
that both apply_scanjoin_target_to_paths() and create_ordered_paths()
are called by grouping_planner().

Can you think of a case I'm missing here that would make it valuable
to generate new parallel plans in create_ordered_paths()?

Good question. Not sure. I think such path would need to do something on
a relation that is neither a join nor a scan - in which case the path
should not be created by apply_scanjoin_target_to_paths().

So for example a query like this:

SELECT
a, b, sum(expensive_function(c))
FROM
t
GROUP BY a,b
ORDER BY a,sum(...) LIMIT 10;

should be able to produce a plan like this:

-> Limit
-> Gather Merge
-> Incremental Sort (pathkeys: a, sum)
-> Group Aggregate
a, b, sum(expensive_function(c))
-> Index Scan (pathkeys: a, b)

or something like that, maybe. I haven't tried this, though. The other
question is whether such queries are useful in practice ...

Hmm, when I step through on that example input_rel->partial_pathlist
!= NIL is false, so we don't even attempt to consider any extra
parallel paths in create_ordered_paths(). Nonetheless we get a
parallel plan, but with a different shape:

Limit
-> Incremental Sort
Sort Key: a, (sum(expensive_function(c)))
Presorted Key: a
-> Finalize GroupAggregate
Group Key: a, b
-> Gather Merge
Workers Planned: 4
-> Partial GroupAggregate
Group Key: a, b
-> Sort
Sort Key: a, b
-> Parallel Seq Scan on t

(or if I disable seqscan then the sort+seq scan above becomes inc sort
+ index scan)

To be honest, I don't think I understand how you would get a plan like
that anyway since the index here only has "a" and so can't provide
(pathkeys: a, b).

Perhaps there's something I'm still missing though.

I wasn't stricly adhering to the example we used before, and I imagined
there would be an index on (a,b). Sorry if that wasn't clear.

Also just realized I don't think (?) we can order by the sum inside a
gather-merge -- at least not without having another sort above the
parallel portion? Or is the group aggregate able to also provide
ordering on the final sum after aggregating the partial sums?

Yes, you're right - an extra sort node would be necessary. But I don't
think that changes the idea behind that example. The question is whether
the extra sorts below the gather would be cheaper than doing a large sort
on top of it - but I don't see why wouldn't that be the case, and if we
only need a couple of rows from the beginning ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#160

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#159)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Jul 20, 2019 at 1:00 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Jul 20, 2019 at 12:21:01PM -0400, James Coleman wrote:

On Sat, Jul 20, 2019 at 12:12 PM James Coleman <jtc331@gmail.com> wrote:

On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...

My current line of investigation is whether we need to do anything in
the parallel portion of create_ordered_paths(). I noticed that the
first-pass patch adding generate_useful_gather_paths() modified that
section but wasn't actually adding any new gather-merge paths (just
bare incremental sort paths). That seems pretty clearly just a
prototype miss, so I modified the prototype to build gather-merge
paths instead (as a side note that change seems to fix an oddity I was
seeing where plans would include a parallel index scan node even
though they weren't parallel plans). While the resulting plan for
something like:

Yes, that seems to be a bug. The block above it clealy has a gather
merge nodes, so this one should too.

explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
t.a, t.b limit 50;

changes cost (to be cheaper) ever so slightly with the gather-merge
addition to create_ordered_paths(), the plan itself is otherwise
identical (including row estimates):

Limit
-> Gather Merge
-> Incremental Sort
-> Parallel Index Scan

(Note: I'm forcing parallel plans here with: set
max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
set parallel_tuple_cost=0; set parallel_setup_cost=0; set
min_parallel_index_scan_size=0;)

I can't seem to come up with a case where adding these gather-merge
paths in create_ordered_paths() isn't entirely duplicative of paths
already created by generate_useful_gather_paths() as called from
apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
that both apply_scanjoin_target_to_paths() and create_ordered_paths()
are called by grouping_planner().

Can you think of a case I'm missing here that would make it valuable
to generate new parallel plans in create_ordered_paths()?

Good question. Not sure. I think such path would need to do something on
a relation that is neither a join nor a scan - in which case the path
should not be created by apply_scanjoin_target_to_paths().

So for example a query like this:

SELECT
a, b, sum(expensive_function(c))
FROM
t
GROUP BY a,b
ORDER BY a,sum(...) LIMIT 10;

should be able to produce a plan like this:

-> Limit
-> Gather Merge
-> Incremental Sort (pathkeys: a, sum)
-> Group Aggregate
a, b, sum(expensive_function(c))
-> Index Scan (pathkeys: a, b)

or something like that, maybe. I haven't tried this, though. The other
question is whether such queries are useful in practice ...

Hmm, when I step through on that example input_rel->partial_pathlist
!= NIL is false, so we don't even attempt to consider any extra
parallel paths in create_ordered_paths(). Nonetheless we get a
parallel plan, but with a different shape:

Limit
-> Incremental Sort
Sort Key: a, (sum(expensive_function(c)))
Presorted Key: a
-> Finalize GroupAggregate
Group Key: a, b
-> Gather Merge
Workers Planned: 4
-> Partial GroupAggregate
Group Key: a, b
-> Sort
Sort Key: a, b
-> Parallel Seq Scan on t

(or if I disable seqscan then the sort+seq scan above becomes inc sort
+ index scan)

To be honest, I don't think I understand how you would get a plan like
that anyway since the index here only has "a" and so can't provide
(pathkeys: a, b).

Perhaps there's something I'm still missing though.

I wasn't stricly adhering to the example we used before, and I imagined
there would be an index on (a,b). Sorry if that wasn't clear.

Also just realized I don't think (?) we can order by the sum inside a
gather-merge -- at least not without having another sort above the
parallel portion? Or is the group aggregate able to also provide
ordering on the final sum after aggregating the partial sums?

Yes, you're right - an extra sort node would be necessary. But I don't
think that changes the idea behind that example. The question is whether
the extra sorts below the gather would be cheaper than doing a large sort
on top of it - but I don't see why wouldn't that be the case, and if we
only need a couple of rows from the beginning ...

Ah, I see. Right now we get:

Limit
-> Incremental Sort
Sort Key: a, (sum(expensive_function(c)))
Presorted Key: a
-> Finalize GroupAggregate
Group Key: a, b
-> Gather Merge
Workers Planned: 2
-> Partial GroupAggregate
Group Key: a, b
-> Parallel Index Scan using t_a_b on t

even with the parallel additions to create_ordered_paths() -- that
addition doesn't actually add any new parallel paths because
input_rel->partial_pathlist != NIL is false (I'm not sure why yet), so
if we want (if I understand correctly) something more like:

I'm still struggling to understand though how another incremental sort
below the gather-merge would actually be able to help us. For one I'm
not sure it would be less expensive, but more importantly I'm not sure
how we could do that and maintain correctness. Wouldn't a per-worker
sum not actually be useful in sorting since it has no predictable
impact on the ordering of the total sum?

Describing that got me thinking of similar cases where ordering of the
partial aggregate would (in theory) be a correct partial sort for the
total ordering, and it seems like min() and max() would be. So I ran
the same experiment with that instead of sum(), but, you guessed it,
input_rel->partial_pathlist != NIL is false again, so we don't add any
parallel paths in create_ordered_paths().

I'm leaning towards thinking considering parallel incremental sorts in
create_ordered_paths() won't add value. But I also feel like this
whole project has me jumping into the deep end of the pool again (this
time the planner), so I'm still picking up a lot of pieces for how all
this fits together, and as such I don't have a great intuitive grasp
yet of how this particular part of the planning process maps to the
kind of queries and plans we consider. All that to say: if you have
further thoughts, I'm happy to look into it, but right now I'm not
seeing anything.

James Coleman

#161

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#160)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Jul 20, 2019 at 07:37:08PM -0400, James Coleman wrote:

..

Yes, you're right - an extra sort node would be necessary. But I don't
think that changes the idea behind that example. The question is whether
the extra sorts below the gather would be cheaper than doing a large sort
on top of it - but I don't see why wouldn't that be the case, and if we
only need a couple of rows from the beginning ...

Ah, I see. Right now we get:

Limit
-> Incremental Sort
Sort Key: a, (sum(expensive_function(c)))
Presorted Key: a
-> Finalize GroupAggregate
Group Key: a, b
-> Gather Merge
Workers Planned: 2
-> Partial GroupAggregate
Group Key: a, b
-> Parallel Index Scan using t_a_b on t

even with the parallel additions to create_ordered_paths() -- that
addition doesn't actually add any new parallel paths because
input_rel->partial_pathlist != NIL is false (I'm not sure why yet), so
if we want (if I understand correctly) something more like:

Well, in this particular case it's fairly simple, I think. We only call
the create_ordered_paths() from the grouping planner, on the upper
relation that represents the result of the GROUP BY. But that means it
has to see only the finalized result (after Finalize GroupAggregate). So
it can't see partial aggregation or any other partial path. So in this
case it seems guaranteed (partial_pathlist == NIL).

So maybe we should not be looking at GROUP BY queries, which probably
can't hit this particular code path at all - we need a different type of
upper relation. For example UNION ALL should hit this code, I think.

So maybe try

select * from t union all select * from t order by a, b limit 10;

and that will hit this condition with partial_pathlist != NIL.

I don't know if such queries can benefit from incremental sort, though.
There are other upper relations too:

typedef enum UpperRelationKind
{
UPPERREL_SETOP, /* result of UNION/INTERSECT/EXCEPT, if any */
UPPERREL_PARTIAL_GROUP_AGG, /* result of partial grouping/aggregation, if
* any */
UPPERREL_GROUP_AGG, /* result of grouping/aggregation, if any */
UPPERREL_WINDOW, /* result of window functions, if any */
UPPERREL_DISTINCT, /* result of "SELECT DISTINCT", if any */
UPPERREL_ORDERED, /* result of ORDER BY, if any */
UPPERREL_FINAL /* result of any remaining top-level actions */
/* NB: UPPERREL_FINAL must be last enum entry; it's used to size arrays */
} UpperRelationKind;

I'm still struggling to understand though how another incremental sort
below the gather-merge would actually be able to help us. For one I'm
not sure it would be less expensive,

That's a good question. I think in general we agree that if we can get
the gather merge to sort the data the way the operation above the gather
merge (which is the first operation that can't operate in parallel
mode), that's probably beneficial.

So this pattern seems reasonable:

-> something
-> non-parallel operation
-> gather merge
-> incremental sort
-> something

And it's likely faster, especially when the parts above this can
leverage the lower startup cost. Say, when there's an explicit LIMIT.

I think the question is where exactly do we add the incremental sort.
It's quite possible some of the places we modified are redundant, at
least for some queries. Not sure.

but more importantly I'm not sure how we could do that and maintain
correctness. Wouldn't a per-worker sum not actually be useful in
sorting since it has no predictable impact on the ordering of the total
sum?

Yes, you're right - that wouldn't be correct. The reason why I've been
thinking about such examples because distributed databases do make such
things in some cases, and we might do that too with partitioning.

Consider a simple example

create table t (a int, b int, c int) partition by hash (a);
create table t0 partition of t for values with (modulus 4, remainder 0);
create table t1 partition of t for values with (modulus 4, remainder 1);
create table t2 partition of t for values with (modulus 4, remainder 2);
create table t3 partition of t for values with (modulus 4, remainder 3);
insert into t select mod(i,1000), i, i from generate_series(1,1000000) s(i);
analyze t;
select a, count(b) from t group by a order by a, count(b) limit 10;

In this case we might do a plan similar to what I proposed, assuming
each worker gets to execute on a different partition (because then we
know each worker will see distinct groups, thanks to the partitioning).

But AFAIK we don't do such optimizations yet, so it's probably just a
distraction.

Describing that got me thinking of similar cases where ordering of the
partial aggregate would (in theory) be a correct partial sort for the
total ordering, and it seems like min() and max() would be. So I ran
the same experiment with that instead of sum(), but, you guessed it,
input_rel->partial_pathlist != NIL is false again, so we don't add any
parallel paths in create_ordered_paths().

Right. That's because with aggregation, grouping planner only sees the
total result, not the partial paths. We need different upper rel to
exercise that code path.

I'm leaning towards thinking considering parallel incremental sorts in
create_ordered_paths() won't add value. But I also feel like this
whole project has me jumping into the deep end of the pool again (this
time the planner), so I'm still picking up a lot of pieces for how all
this fits together, and as such I don't have a great intuitive grasp
yet of how this particular part of the planning process maps to the
kind of queries and plans we consider. All that to say: if you have
further thoughts, I'm happy to look into it, but right now I'm not
seeing anything.

Understood. FWIW I'm not particularly familiar with this code (or which
places are supposed to work together), so I definitely agree it may be
overwhelming. Especially when it's only a part of a larger patch.

I wonder if we're approaching this wrong. Maybe we should not reverse
engineer queries for the various places, but just start with a set of
queries that we want to optimize, and then identify which places in the
planner need to be modified.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#162

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Tomas Vondra (#161)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:

...

I wonder if we're approaching this wrong. Maybe we should not reverse
engineer queries for the various places, but just start with a set of
queries that we want to optimize, and then identify which places in the
planner need to be modified.

I've decided to do a couple of experiments, trying to make my mind about
which modified places matter to diffrent queries. But instead of trying
to reverse engineer the queries, I've taken a different approach - I've
compiled a list of queries that I think are sensible and relevant, and
then planned them with incremental sort enabled in different places.

I don't have any clear conclusions at this point - it does show some of
the places don't change plan for any of the queries, although there may
be some additional query where it'd make a difference.

But I'm posting this mostly because it might be useful. I've initially
planned to move changes that add incremental sort paths to separate
patches, and then apply/skip different subsets of those patches. But
then I realized there's a better way to do this - I've added a bunch of
GUCs, one for each such place. This allows doing this testing without
having to rebuild repeatedly.

I'm not going to post the patch(es) with extra GUCs here, because it'd
just confuse the patch tester, but it's available here:

https://github.com/tvondra/postgres/tree/incremental-sort-20190730

There are 10 GUCs, one for each place in planner where incremental sort
paths are constructed. By default all those are set to 'false' so no
incremental sort paths are built. If you do

SET devel_create_ordered_paths = on;

it'll start creating the paths in non-parallel in create_ordered_paths.
Then you may enable devel_create_ordered_paths_parallel to also consider
parallel paths, etc.

The list of queries (synthetic, but hopefully sufficiently realistic)
and a couple of scripts to collect the plans is in this repository:

https://github.com/tvondra/incremental-sort-tests-2

There's also a spreadsheet with a summary of results, with a visual
representation of which GUCs affect which queries.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#163

Rafia Sabih

rafia.pghackers@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#162)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, 30 Jul 2019 at 02:17, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:

...

I wonder if we're approaching this wrong. Maybe we should not reverse
engineer queries for the various places, but just start with a set of
queries that we want to optimize, and then identify which places in the
planner need to be modified.

I've decided to do a couple of experiments, trying to make my mind about
which modified places matter to diffrent queries. But instead of trying
to reverse engineer the queries, I've taken a different approach - I've
compiled a list of queries that I think are sensible and relevant, and
then planned them with incremental sort enabled in different places.

I don't have any clear conclusions at this point - it does show some of
the places don't change plan for any of the queries, although there may
be some additional query where it'd make a difference.

But I'm posting this mostly because it might be useful. I've initially
planned to move changes that add incremental sort paths to separate
patches, and then apply/skip different subsets of those patches. But
then I realized there's a better way to do this - I've added a bunch of
GUCs, one for each such place. This allows doing this testing without
having to rebuild repeatedly.

I'm not going to post the patch(es) with extra GUCs here, because it'd
just confuse the patch tester, but it's available here:

https://github.com/tvondra/postgres/tree/incremental-sort-20190730

There are 10 GUCs, one for each place in planner where incremental sort
paths are constructed. By default all those are set to 'false' so no
incremental sort paths are built. If you do

SET devel_create_ordered_paths = on;

it'll start creating the paths in non-parallel in create_ordered_paths.
Then you may enable devel_create_ordered_paths_parallel to also consider
parallel paths, etc.

The list of queries (synthetic, but hopefully sufficiently realistic)
and a couple of scripts to collect the plans is in this repository:

https://github.com/tvondra/incremental-sort-tests-2

There's also a spreadsheet with a summary of results, with a visual
representation of which GUCs affect which queries.

Wow, that sounds like an elaborate experiment. But where is this

spreadsheet you mentioned ?

--
Regards,
Rafia Sabih

#164

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Rafia Sabih (#163)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Sep 04, 2019 at 11:37:48AM +0200, Rafia Sabih wrote:

On Tue, 30 Jul 2019 at 02:17, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:

...

I wonder if we're approaching this wrong. Maybe we should not reverse
engineer queries for the various places, but just start with a set of
queries that we want to optimize, and then identify which places in the
planner need to be modified.

I've decided to do a couple of experiments, trying to make my mind about
which modified places matter to diffrent queries. But instead of trying
to reverse engineer the queries, I've taken a different approach - I've
compiled a list of queries that I think are sensible and relevant, and
then planned them with incremental sort enabled in different places.

I don't have any clear conclusions at this point - it does show some of
the places don't change plan for any of the queries, although there may
be some additional query where it'd make a difference.

But I'm posting this mostly because it might be useful. I've initially
planned to move changes that add incremental sort paths to separate
patches, and then apply/skip different subsets of those patches. But
then I realized there's a better way to do this - I've added a bunch of
GUCs, one for each such place. This allows doing this testing without
having to rebuild repeatedly.

I'm not going to post the patch(es) with extra GUCs here, because it'd
just confuse the patch tester, but it's available here:

https://github.com/tvondra/postgres/tree/incremental-sort-20190730

There are 10 GUCs, one for each place in planner where incremental sort
paths are constructed. By default all those are set to 'false' so no
incremental sort paths are built. If you do

SET devel_create_ordered_paths = on;

it'll start creating the paths in non-parallel in create_ordered_paths.
Then you may enable devel_create_ordered_paths_parallel to also consider
parallel paths, etc.

The list of queries (synthetic, but hopefully sufficiently realistic)
and a couple of scripts to collect the plans is in this repository:

https://github.com/tvondra/incremental-sort-tests-2

There's also a spreadsheet with a summary of results, with a visual
representation of which GUCs affect which queries.

Wow, that sounds like an elaborate experiment. But where is this
spreadsheet you mentioned ?

It seems I forgot to push the commit containing the spreadsheet with
results. I'll fix that tomorrow.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#165

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Tomas Vondra (#164)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Sep 04, 2019 at 09:17:10PM +0200, Tomas Vondra wrote:

On Wed, Sep 04, 2019 at 11:37:48AM +0200, Rafia Sabih wrote:

On Tue, 30 Jul 2019 at 02:17, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:

...

I wonder if we're approaching this wrong. Maybe we should not reverse
engineer queries for the various places, but just start with a set of
queries that we want to optimize, and then identify which places in the
planner need to be modified.

I've decided to do a couple of experiments, trying to make my mind about
which modified places matter to diffrent queries. But instead of trying
to reverse engineer the queries, I've taken a different approach - I've
compiled a list of queries that I think are sensible and relevant, and
then planned them with incremental sort enabled in different places.

I don't have any clear conclusions at this point - it does show some of
the places don't change plan for any of the queries, although there may
be some additional query where it'd make a difference.

But I'm posting this mostly because it might be useful. I've initially
planned to move changes that add incremental sort paths to separate
patches, and then apply/skip different subsets of those patches. But
then I realized there's a better way to do this - I've added a bunch of
GUCs, one for each such place. This allows doing this testing without
having to rebuild repeatedly.

I'm not going to post the patch(es) with extra GUCs here, because it'd
just confuse the patch tester, but it's available here:

https://github.com/tvondra/postgres/tree/incremental-sort-20190730

There are 10 GUCs, one for each place in planner where incremental sort
paths are constructed. By default all those are set to 'false' so no
incremental sort paths are built. If you do

SET devel_create_ordered_paths = on;

it'll start creating the paths in non-parallel in create_ordered_paths.
Then you may enable devel_create_ordered_paths_parallel to also consider
parallel paths, etc.

The list of queries (synthetic, but hopefully sufficiently realistic)
and a couple of scripts to collect the plans is in this repository:

https://github.com/tvondra/incremental-sort-tests-2

There's also a spreadsheet with a summary of results, with a visual
representation of which GUCs affect which queries.

Wow, that sounds like an elaborate experiment. But where is this
spreadsheet you mentioned ?

It seems I forgot to push the commit containing the spreadsheet with
results. I'll fix that tomorrow.

OK, I've pushed the commit with the spreadsheet. The single sheet lists
the synthetic queries, and hashes of plans with different flags enables
(parallel query, force incremental sort, and the new developer GUCs
mentioned before). Only a single developer flag is set to true (or none
of them).

The columns at the end simply say whether the plan differs from the plan
generated by master (no patches). TRUE means "same as master" while
FALSE means "different plan.

The "patched" column means all developer GUCs disabled, so it's expected
to produce the same plan as master (and it is). And then there's one
column for each developer GUC. If the column is just TRUE it means the
GUC does not affect any of the synthetic queries. There are 4 of them:

- devel_add_paths_to_grouping_rel_parallel
- devel_create_partial_grouping_paths
- devel_gather_grouping_paths
- devel_standard_join_search

The places controlled by those GUCs are either useless, or the query
affected by them is not included in the list of queries.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#166

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Tomas Vondra (#162)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 2019-Jul-30, Tomas Vondra wrote:

On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:

I wonder if we're approaching this wrong. Maybe we should not reverse
engineer queries for the various places, but just start with a set of
queries that we want to optimize, and then identify which places in the
planner need to be modified.

[...]

I've decided to do a couple of experiments, trying to make my mind about
which modified places matter to diffrent queries. But instead of trying
to reverse engineer the queries, I've taken a different approach - I've
compiled a list of queries that I think are sensible and relevant, and
then planned them with incremental sort enabled in different places.

[...]

The list of queries (synthetic, but hopefully sufficiently realistic)
and a couple of scripts to collect the plans is in this repository:

https://github.com/tvondra/incremental-sort-tests-2

There's also a spreadsheet with a summary of results, with a visual
representation of which GUCs affect which queries.

OK, so we have that now. I suppose this spreadsheet now tells us which
places are useful and which aren't, at least for the queries that you've
tested. Dowe that mean that we want to get the patch to consider adding
paths only the places that your spreadsheet says are useful? I'm not
sure what the next steps are for this patch.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#167

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Alvaro Herrera (#166)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

OK, so we have that now. I suppose this spreadsheet now tells us which
places are useful and which aren't, at least for the queries that you've
tested. Dowe that mean that we want to get the patch to consider adding
paths only the places that your spreadsheet says are useful? I'm not
sure what the next steps are for this patch.

I wanted to note here that I haven't abandoned this patch, but ended
up needing to use my extra time for working on a conference talk. That
talk is today, so I'm hoping to be able to catch up on this again
soon.

James Coleman

#168

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Alvaro Herrera (#166)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Sep 12, 2019 at 12:49:29PM -0300, Alvaro Herrera wrote:

On 2019-Jul-30, Tomas Vondra wrote:

On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:

I wonder if we're approaching this wrong. Maybe we should not reverse
engineer queries for the various places, but just start with a set of
queries that we want to optimize, and then identify which places in the
planner need to be modified.

[...]

I've decided to do a couple of experiments, trying to make my mind about
which modified places matter to diffrent queries. But instead of trying
to reverse engineer the queries, I've taken a different approach - I've
compiled a list of queries that I think are sensible and relevant, and
then planned them with incremental sort enabled in different places.

[...]

The list of queries (synthetic, but hopefully sufficiently realistic)
and a couple of scripts to collect the plans is in this repository:

https://github.com/tvondra/incremental-sort-tests-2

There's also a spreadsheet with a summary of results, with a visual
representation of which GUCs affect which queries.

OK, so we have that now. I suppose this spreadsheet now tells us which
places are useful and which aren't, at least for the queries that you've
tested. Dowe that mean that we want to get the patch to consider adding
paths only the places that your spreadsheet says are useful? I'm not
sure what the next steps are for this patch.

Yes. I think the spreadsheet call help us with answering two things:

1) places actually affecting the plan (all but three do)

2) redundant places (there are some cases where two GUCs produce the
same plan in the end)

Of course, this does assume the query set makes sense and is somewhat
realistic, but I've tried to construct queries where that is true. We
may extend it over time, of course.

I think we've agreed to add incremental sort paths different places in
separate patches, to make review easier. So this may be a useful way to
decide which places to address first. I'd probably do it in this order:

- create_ordered_paths
- create_ordered_paths (parallel part)
- add_paths_to_grouping_rel
- ... not sure ...

but that's just a proposal. It'd give us most of the benefits, I think,
and we could also focus on the rest of the patch.

Also, regarding the three GUCs that don't affect any of the queries, we
can't really add them as we wouldn't be able to test them. If we manage
to construct a query that'd benefit from them, we can revisit this.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#169

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#167)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Sep 12, 2019 at 11:54:06AM -0400, James Coleman wrote:

OK, so we have that now. I suppose this spreadsheet now tells us which
places are useful and which aren't, at least for the queries that you've
tested. Dowe that mean that we want to get the patch to consider adding
paths only the places that your spreadsheet says are useful? I'm not
sure what the next steps are for this patch.

I wanted to note here that I haven't abandoned this patch, but ended
up needing to use my extra time for working on a conference talk. That
talk is today, so I'm hoping to be able to catch up on this again
soon.

Good! I'm certainly looking forward to a new patch version.

As discussed in the past, this patch is pretty sensitive (large, touches
planning, ...), so we should try getting most of it in not too late in
the cycle. For example 2019-11 would be nice.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#170

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#169)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Sep 13, 2019 at 10:54 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Sep 12, 2019 at 11:54:06AM -0400, James Coleman wrote:

OK, so we have that now. I suppose this spreadsheet now tells us which
places are useful and which aren't, at least for the queries that you've
tested. Dowe that mean that we want to get the patch to consider adding
paths only the places that your spreadsheet says are useful? I'm not
sure what the next steps are for this patch.

I wanted to note here that I haven't abandoned this patch, but ended
up needing to use my extra time for working on a conference talk. That
talk is today, so I'm hoping to be able to catch up on this again
soon.

Good! I'm certainly looking forward to a new patch version.

As discussed in the past, this patch is pretty sensitive (large, touches
planning, ...), so we should try getting most of it in not too late in
the cycle. For example 2019-11 would be nice.

Completely agree; originally I'd hoped to have it in rough draft
finished form to get serious review in the September CF...but...

#171

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#156)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

I think this may be a thinko, as this plan demonstrates - but I'm not
sure about it. I wonder if this might be penalizing some other types of
plans (essentially anything with limit + gather).

Attached is a WIP patch fixing this by considering both startup and
total cost (by calling compare_path_costs_fuzzily).

It seems to me that this is likely a bug, and not just a changed
needed for this. Do you think it's better addressed in a separate
thread? Or retain it as part of this patch for now (and possibly break
it out later)? On the other hand, it's entirely possible that someone
more familiar with parallel plan limitations could explain why the
above comment holds true. That makes me lean towards asking in a new
thread.

Maybe. I think creating a separate thread would be useful, provided we
manage to demonstrate the issue without an incremental sort.

I did some more thinking about this, and I can't currently come up
with a way to reproduce this issue outside of this patch. It doesn't
seem reasonable to me to assume that there's anything inherent about
this patch that means it's the only way we can end up with a partial
path with a low startup cost we'd want to prefer.

Part of me wants to pull it over to a separate thread just to get
additional feedback, but I'm not sure how useful that is given we
don't currently have an example case outside of this patch.

One thing to note though: the current patch does not also modify
add_partial_path_precheck which also does not take into account
startup cost, so we probably need to update that for completeness's
sake.

James Coleman

#172

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#171)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Sep 15, 2019 at 09:33:33PM -0400, James Coleman wrote:

On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

I think this may be a thinko, as this plan demonstrates - but I'm not
sure about it. I wonder if this might be penalizing some other types of
plans (essentially anything with limit + gather).

Attached is a WIP patch fixing this by considering both startup and
total cost (by calling compare_path_costs_fuzzily).

It seems to me that this is likely a bug, and not just a changed
needed for this. Do you think it's better addressed in a separate
thread? Or retain it as part of this patch for now (and possibly break
it out later)? On the other hand, it's entirely possible that someone
more familiar with parallel plan limitations could explain why the
above comment holds true. That makes me lean towards asking in a new
thread.

Maybe. I think creating a separate thread would be useful, provided we
manage to demonstrate the issue without an incremental sort.

I did some more thinking about this, and I can't currently come up
with a way to reproduce this issue outside of this patch. It doesn't
seem reasonable to me to assume that there's anything inherent about
this patch that means it's the only way we can end up with a partial
path with a low startup cost we'd want to prefer.

Part of me wants to pull it over to a separate thread just to get
additional feedback, but I'm not sure how useful that is given we
don't currently have an example case outside of this patch.

Hmm, I see.

While I initially suggested to start a separate thread only if we have
example not involving an incremental sort, that's probably not a hard
requirement. I think it's fine to start a thead briefly explaining the
issue, and pointing to incremental sort thread for actual example.

One thing to note though: the current patch does not also modify
add_partial_path_precheck which also does not take into account
startup cost, so we probably need to update that for completeness's
sake.

Good point. It does indeed seem to make the same assumption about only
comparing total cost before calling add_path_precheck.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#173

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#165)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Sep 9, 2019 at 5:55 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The "patched" column means all developer GUCs disabled, so it's expected
to produce the same plan as master (and it is). And then there's one
column for each developer GUC. If the column is just TRUE it means the
GUC does not affect any of the synthetic queries. There are 4 of them:

- devel_add_paths_to_grouping_rel_parallel
- devel_create_partial_grouping_paths
- devel_gather_grouping_paths
- devel_standard_join_search

The places controlled by those GUCs are either useless, or the query
affected by them is not included in the list of queries.

I'd previously found (in my reverse engineering efforts) the query:

select *
from tenk1 t1
join tenk1 t2 on t1.hundred = t2.hundred
join tenk1 t3 on t1.hundred = t3.hundred
order by t1.hundred, t1.twenty
limit 50;

can change plans to use incremental sort when
generate_useful_gather_paths() is added to standard_join_search().
Specifically, we get a merge join between t1 and t3 as the top level
(besides limit) node where the driving side of the join is a gather
merge with incremental sort. This does rely on these gucs set in the
test harness:

set local max_parallel_workers_per_gather=4;
set local min_parallel_table_scan_size=0;
set local parallel_tuple_cost=0;
set local parallel_setup_cost=0;

So I think we can reduce the number of unused gucs to 3.

James

#174

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#172)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Sep 16, 2019 at 6:32 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sun, Sep 15, 2019 at 09:33:33PM -0400, James Coleman wrote:

On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

I think this may be a thinko, as this plan demonstrates - but I'm not
sure about it. I wonder if this might be penalizing some other types of
plans (essentially anything with limit + gather).

Attached is a WIP patch fixing this by considering both startup and
total cost (by calling compare_path_costs_fuzzily).

It seems to me that this is likely a bug, and not just a changed
needed for this. Do you think it's better addressed in a separate
thread? Or retain it as part of this patch for now (and possibly break
it out later)? On the other hand, it's entirely possible that someone
more familiar with parallel plan limitations could explain why the
above comment holds true. That makes me lean towards asking in a new
thread.

Maybe. I think creating a separate thread would be useful, provided we
manage to demonstrate the issue without an incremental sort.

I did some more thinking about this, and I can't currently come up
with a way to reproduce this issue outside of this patch. It doesn't
seem reasonable to me to assume that there's anything inherent about
this patch that means it's the only way we can end up with a partial
path with a low startup cost we'd want to prefer.

Part of me wants to pull it over to a separate thread just to get
additional feedback, but I'm not sure how useful that is given we
don't currently have an example case outside of this patch.

Hmm, I see.

While I initially suggested to start a separate thread only if we have
example not involving an incremental sort, that's probably not a hard
requirement. I think it's fine to start a thead briefly explaining the
issue, and pointing to incremental sort thread for actual example.

One thing to note though: the current patch does not also modify
add_partial_path_precheck which also does not take into account
startup cost, so we probably need to update that for completeness's
sake.

Good point. It does indeed seem to make the same assumption about only
comparing total cost before calling add_path_precheck.

I've started a new thread to discuss:
/messages/by-id/CAAaqYe-5HmM4ih6FWp2RNV9rruunfrFrLhqFXF_nrrNCPy1Zhg@mail.gmail.com

James

#175

James Coleman

jtc331@gmail.com

over 6 years ago

In reply to: Tomas Vondra (#168)

2 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Sep 13, 2019 at 10:51 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Sep 12, 2019 at 12:49:29PM -0300, Alvaro Herrera wrote:

On 2019-Jul-30, Tomas Vondra wrote:

I've decided to do a couple of experiments, trying to make my mind about
which modified places matter to diffrent queries. But instead of trying
to reverse engineer the queries, I've taken a different approach - I've
compiled a list of queries that I think are sensible and relevant, and
then planned them with incremental sort enabled in different places.

[...]

The list of queries (synthetic, but hopefully sufficiently realistic)
and a couple of scripts to collect the plans is in this repository:

https://github.com/tvondra/incremental-sort-tests-2

There's also a spreadsheet with a summary of results, with a visual
representation of which GUCs affect which queries.

OK, so we have that now. I suppose this spreadsheet now tells us which
places are useful and which aren't, at least for the queries that you've
tested. Dowe that mean that we want to get the patch to consider adding
paths only the places that your spreadsheet says are useful? I'm not
sure what the next steps are for this patch.

Yes. I think the spreadsheet call help us with answering two things:

1) places actually affecting the plan (all but three do)

2) redundant places (there are some cases where two GUCs produce the
same plan in the end)

To expand on this further, (1) should probably help us to be able to
write test cases.

Additionally, one big thing we still need that's somewhat external to
the patch is a good way to benchmark/a set of queries that we believe
are representative enough to be good benchmarks.

I'd really appreciate some input from you all on that particular
question; I feel like it's in some sense the biggest barrier to
getting the patch merged, but also the part where long experience in
the community/exposure to other use cases will probably be quite
valuable.

Of course, this does assume the query set makes sense and is somewhat
realistic, but I've tried to construct queries where that is true. We
may extend it over time, of course.

I think we've agreed to add incremental sort paths different places in
separate patches, to make review easier. So this may be a useful way to
decide which places to address first. I'd probably do it in this order:

- create_ordered_paths
- create_ordered_paths (parallel part)
- add_paths_to_grouping_rel
- ... not sure ...

but that's just a proposal. It'd give us most of the benefits, I think,
and we could also focus on the rest of the patch.

Certainly the first two seem like pretty obvious most necessary base
cases. I think supporting group bys also seems like a pretty standard
case, so at first glance I'd say this seems like a reasonable course
to me.

I'm going to start breaking up the patches in this thread into a
series in support of that. Since I've started a new thread with the
add_partial_path change, I'll include that patch here as part of this
series also. Do you think it's worth moving the tuplesort changes into
a standalone patch in the series also?

Attached is a rebased v31 now broken into the following:

- 001-consider-startup-cost-in-add-partial-path_v1.patch: From the
other thread (Tomas's patch unmodified)
- 002-incremental-sort_v31.patch: Updated base incremental sort patch

Besides rebasing, I've changed the enable_incrementalsort GUC to
prevent generating paths entirely rather than being cost-based, since
incremental sort is never absolutely necessary in the way regular sort
is.

I'm hoping to add 003 soon with the initial parallel parts, but I'm
about out of time right now and wanted to get something out, so
sending this without that.

Side question: for the patch tester do I have to attach each part of
the series each time even if nothing's changed in several of them? And
does the vN number at the end need to stay the same for all of them?
My attachments to this email don't follow that... Also, since this
email changes patch naming, so I need to do anything to clear out the
old ones? (I suppose if not, then that would imply an answer to the
first question also.)

James

Attachments:

001-consider-startup-cost-in-add-partial-path_v1.patchapplication/octet-stream; name=001-consider-startup-cost-in-add-partial-path_v1.patchDownload

commit 02b738b5e32326a24687890e066c66f5508b5976
Author: Tomas Vondra <tomas@2ndquadrant.com>
Date:   Sun Jul 28 15:55:54 2019 +0200

    Consider low startup cost when adding partial path
    
    45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
    comment:
    
    > Neither do we need to consider startup costs:
    > parallelism is only used for plans that will be run to completion.
    > Therefore, this routine is much simpler than add_path: it needs to
    > consider only pathkeys and total cost.
    
    I'm not entirely sure if that is still true or not--I can't easily come
    up with a scenario in which it's not, but I also can't come up with an
    inherent reason why such a scenario cannot exist.
    
    Regardless, the in-progress incremental sort patch uncovered a new case
    where it definitely no longer holds, and, as a result a higher cost plan
    ends up being chosen because a low startup cost partial path is ignored
    in favor of a lower total cost partial path and a limit is a applied on
    top of that which would normal favor the lower startup cost plan.

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 34acb732ee..5d66fc2177 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -778,41 +778,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Unless pathkeys are incompatible, keep just one of the two paths. */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
-			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}

002-incremental-sort_v31.patchapplication/octet-stream; name=002-incremental-sort_v31.patchDownload

commit 3fe3811b79317f418c255d25eb5a760799d26287
Author: jcoleman <jtc331@gmail.com>
Date:   Fri Sep 27 19:36:53 2019 +0000

    Implement incremental sort
    
    Incremental sort is an optimized variant of multikey sort for cases
    when the input is already sorted by a prefix of the sort keys. For
    example when a sort by (key1, key2 ... keyN) is requested, and the
    input is already sorted by (key1, key2 ... keyM), M < N, we can
    divide the input into groups where keys (key1, ... keyM) are equal,
    and only sort on the remaining columns.
    
    The implemented algorithm operates in two different modes:
      - Fetching a minimum number of tuples without checking prefix key
        group membership and sorting on all columns when safe.
      - Fetching all tuples for a single prefix key group and sorting on
        solely the unsorted columns.
    We always begin in the first mode, and employ a heuristic to switch
    into the second mode if we believe it's beneficial.
    
    Sorting incrementally can potentially use less memory (and possibly
    avoid spilling to disk), avoid fetching and sorting all tuples in the
    dataset (particularly useful when a LIMIT clause has been specified),
    and begin returning tuples before the entire result set is available.
    Small datasets which fit entirely in memory and must be fully realized
    and sorted may be slightly slower, which we reflect in the costing
    implementation.
    
    The hybrid mode approach allows us to optimize for both very small
    groups (where the overhead of a new tuplesort is high) and very large
    groups (where we can lower cost by not having to sort on already sorted
    columns), albeit at some extra cost while switching between modes.
    
    Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4784b4b18e..61a20c59e3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4383,6 +4383,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..8a3bf8a4e5 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1215,6 +1219,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1841,6 +1848,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2175,12 +2188,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2191,7 +2221,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2215,7 +2245,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2284,7 +2314,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2341,7 +2371,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2354,13 +2384,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2400,9 +2431,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2612,6 +2647,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->fullsort_state != NULL)
+	{
+		/* TODO: is it valid to get space used etc. only once given we re-use the sort? */
+		/* TODO: maybe show average, min, max sort group size? */
+
+		Tuplesortstate *fullsort_state = incrsortstate->fullsort_state;
+		TuplesortInstrumentation fullsort_stats;
+		const char *fullsort_sortMethod;
+		const char *fullsort_spaceType;
+		Tuplesortstate *prefixsort_state = incrsortstate->prefixsort_state;
+		TuplesortInstrumentation prefixsort_stats;
+		const char *prefixsort_sortMethod;
+		const char *prefixsort_spaceType;
+
+		tuplesort_get_stats(fullsort_state, &fullsort_stats);
+		fullsort_sortMethod = tuplesort_method_name(fullsort_stats.sortMethod);
+		fullsort_spaceType = tuplesort_space_type_name(fullsort_stats.spaceType);
+		if (prefixsort_state != NULL)
+		{
+			tuplesort_get_stats(prefixsort_state, &prefixsort_stats);
+			prefixsort_sortMethod = tuplesort_method_name(prefixsort_stats.sortMethod);
+			prefixsort_spaceType = tuplesort_space_type_name(prefixsort_stats.spaceType);
+		}
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: Full: %s  %s: %ldkB",
+							 fullsort_sortMethod, fullsort_spaceType,
+							 fullsort_stats.spaceUsed);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %s %s: %ldkB\n",
+								 prefixsort_sortMethod, prefixsort_spaceType,
+								 prefixsort_stats.spaceUsed);
+			else
+				appendStringInfo(es->str, "\n");
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: Full:  %ld",
+							 incrsortstate->fullsort_group_count);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %ld\n",
+							 incrsortstate->prefixsort_group_count);
+			else
+				appendStringInfo(es->str, "\n");
+		}
+		else
+		{
+			/* TODO */
+			ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+			ExplainPropertyInteger("Full Sort Space Used", "kB",
+					fullsort_stats.spaceUsed, es);
+			ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+			ExplainPropertyInteger("Full Sort Groups", NULL,
+								   incrsortstate->fullsort_group_count, es);
+
+			if (prefixsort_state != NULL)
+			{
+				ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+				ExplainPropertyInteger("Prefix Sort Space Used", "kB",
+						prefixsort_stats.spaceUsed, es);
+				ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+				ExplainPropertyInteger("Prefix Sort Groups", NULL,
+									   incrsortstate->prefixsort_group_count, es);
+			}
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+				&incrsortstate->shared_info->sinfo[n];
+			TuplesortInstrumentation *fullsort_instrument;
+			const char *fullsort_sortMethod;
+			const char *fullsort_spaceType;
+			long		fullsort_spaceUsed;
+			int64		fullsort_group_count;
+			TuplesortInstrumentation *prefixsort_instrument;
+			const char *prefixsort_sortMethod;
+			const char *prefixsort_spaceType;
+			long		prefixsort_spaceUsed;
+			int64		prefixsort_group_count;
+
+			fullsort_instrument = &incsort_info->fullsort_instrument;
+			fullsort_group_count = incsort_info->fullsort_group_count;
+
+			prefixsort_instrument = &incsort_info->prefixsort_instrument;
+			prefixsort_group_count = incsort_info->prefixsort_group_count;
+
+			if (fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+
+			fullsort_sortMethod = tuplesort_method_name(
+					fullsort_instrument->sortMethod);
+			fullsort_spaceType = tuplesort_space_type_name(
+					fullsort_instrument->spaceType);
+			fullsort_spaceUsed = fullsort_instrument->spaceUsed;
+
+			if (prefixsort_instrument)
+			{
+				prefixsort_sortMethod = tuplesort_method_name(
+						prefixsort_instrument->sortMethod);
+				prefixsort_spaceType = tuplesort_space_type_name(
+						prefixsort_instrument->spaceType);
+				prefixsort_spaceUsed = prefixsort_instrument->spaceUsed;
+			}
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d: Full Sort Method: %s  %s: %ldkB  Groups: %ld",
+								 n, fullsort_sortMethod, fullsort_spaceType,
+								 fullsort_spaceUsed, fullsort_group_count);
+				if (prefixsort_instrument)
+					appendStringInfo(es->str,
+									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+									 prefixsort_sortMethod, prefixsort_spaceType,
+									 prefixsort_spaceUsed, prefixsort_group_count);
+				else
+					appendStringInfo(es->str, "\n");
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+				ExplainPropertyInteger("Full Sort Space Used", "kB", fullsort_spaceUsed, es);
+				ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+				ExplainPropertyInteger("Full Sort Groups", NULL, fullsort_group_count, es);
+				if (prefixsort_instrument)
+				{
+					ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+					ExplainPropertyInteger("Prefix Sort Space Used", "kB", prefixsort_spaceUsed, es);
+					ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+					ExplainPropertyInteger("Prefix Sort Groups", NULL, prefixsort_group_count, es);
+				}
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..572aca05fb 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,8 +24,8 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
        nodeLimit.o nodeLockRows.o nodeGatherMerge.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeProjectSet.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o \
+       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o \
+       nodeSort.o nodeIncrementalSort.o nodeUnique.o nodeValuesscan.o \
        nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1f18e5d3a2..8680e7d911 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -31,6 +31,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -254,6 +255,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -559,8 +564,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 53cd2fc666..bf11a08644 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -955,6 +964,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1015,6 +1025,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1301,6 +1314,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c227282975..a9dd08fa6f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -314,6 +315,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -694,6 +700,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -840,6 +850,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState  *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..c3b903e568
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1107 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int presortedCols, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail keys
+	 * are more likely to change. Therefore we do our comparison starting from
+	 * the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(IncrementalSortState *node)
+{
+	ScanDirection		dir;
+	int64 nTuples = 0;
+	bool lastTuple = false;
+	bool firstTuple = true;
+	TupleDesc		    tupDesc;
+	PlanState		   *outerNode;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal
+		 * and thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(
+				tupDesc,
+				plannode->sort.numCols - presortedCols,
+				&(plannode->sort.sortColIdx[presortedCols]),
+				&(plannode->sort.sortOperators[presortedCols]),
+				&(plannode->sort.collations[presortedCols]),
+				&(plannode->sort.nullsFirst[presortedCols]),
+				work_mem,
+				NULL,
+				false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure
+	 * the tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+				node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+					ScanDirectionIsForward(dir),
+					false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to save the
+			 * first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/* The tuple isn't part of the current batch so we need to carry
+				 * it over into the next set up tuples we transfer out of the full
+				 * sort tuplesort into the presorted prefix tuplesort. We don't
+				 * actually have to do anything special to save the tuple since
+				 * we've already loaded it into the node->transfer_tuple slot, and,
+				 * even though that slot points to memory inside the full sort
+				 * tuplesort, we can't reset that tuplesort anyway until we've
+				 * fully transferred out of its tuples, so this reference is safe.
+				 * We do need to reset the group pivot tuple though since we've
+				 * finished the current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch
+		 * is in the same prefix key group and moved all of those tuples into
+		 * the presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/* Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume
+		 * we have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner node
+		 * read out all of those tuples, and then come back around to find
+		 * another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *read_sortstate;
+	Tuplesortstate	   *fullsort_state;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+			|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->finished)
+			/*
+			 * TODO: there isn't a good test case for the node->finished
+			 * case directly, but lots of other stuff fails if it's not
+			 * there. If the outer node will fail when trying to fetch
+			 * too many tuples, then things break if that test isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the full
+			 * sort tuplesort. The first call to switchToPresortedPrefixMode()
+			 * pulled the one of those groups out, and we've returned those
+			 * tuples to the inner node, but if we tuples remaining in that
+			 * tuplesort (i.e., n_fullsort_remaining > 0) at this point we
+			 * need to do that again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(node);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're not
+			 * in the middle of transitioning into presorted prefix sort mode,
+			 * then it's time to start the process all over again by building
+			 * new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(
+					tupDesc,
+					plannode->sort.numCols,
+					plannode->sort.sortColIdx,
+					plannode->sort.sortOperators,
+					plannode->sort.collations,
+					plannode->sort.nullsFirst,
+					work_mem,
+					NULL,
+					false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64 currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n heap
+			 * sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/* Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't assume
+			 * the group pivot tuple will reamin the same -- unless we're using
+			 * a minimum group size of 1, in which case the pivot is obviously
+			 * still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us anymore tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then don't
+				 * both with checking for inclusion in the current prefix group
+				 * since a large number of very tiny sorts is inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this
+					 * sort group. Instead we need to carry it over to the
+					 * next group. We use the group_pivot slot as a temp
+					 * container for that purpose even though we won't actually
+					 * treat it as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound remaining
+						 * is (original bound - n), so store the current number
+						 * of processed tuples for use in configuring sorting
+						 * bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+					tuplesort_performsort(fullsort_state);
+					node->fullsort_group_count++;
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found
+			 * a large group of tuples having a single prefix key (as long
+			 * as the last tuple didn't shift us into reading from the full
+			 * sort mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+					node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into the
+				 * tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n heapsort mode
+				 * then we will only be able to retrieve currentBound tuples (since the
+				 * tuplesort will have only retained the top-n tuples). This is safe even
+				 * though we haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the retained ones,
+				 * and we're already contractually guaranteed to not need any more than the
+				 * currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64 currentBound = node->bound - node->bound_Done;
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the fullsort
+				 * to presorted prefix sort (we might have multiple prefix key
+				 * groups, so we need a way to see if we've actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(node);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop out
+				 * of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've already
+		 * established a current group pivot tuple (but wasn't carried over;
+		 * it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix tuplesort
+				 * until we find changed prefix keys. Only then can we guarantee
+				 * sort stability of the tuples we've already accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		IncrementalSortInfo *incsort_info =
+			&node->shared_info->sinfo[ParallelWorkerNumber];
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		tuplesort_get_stats(fullsort_state, &incsort_info->fullsort_instrument);
+		incsort_info->fullsort_group_count = node->fullsort_group_count;
+
+		if (node->prefixsort_state)
+		{
+			tuplesort_get_stats(node->prefixsort_state,
+					&incsort_info->prefixsort_instrument);
+			incsort_info->prefixsort_group_count = node->prefixsort_group_count;
+		}
+	}
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_group_count = 0;
+	incrsortstate->prefixsort_group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+		tuplesort_end(node->fullsort_state);
+	node->fullsort_state = NULL;
+	if (node->prefixsort_state != NULL)
+		tuplesort_end(node->prefixsort_state);
+	node->prefixsort_state = NULL;
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 92855278ad..3ea1b1bca1 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 3432bb921d..5477d9ad08 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -924,6 +924,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -935,13 +953,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4874,6 +4908,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b0dcd02ff6..9801f8fa2d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -833,10 +833,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -846,6 +844,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3771,6 +3787,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..c2847e8d3f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2117,12 +2117,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2131,6 +2132,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2765,6 +2792,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index db3a68a51d..adca8322aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..f6ec405d58 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 2f4fea241a..e88a87e189 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int	n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none.
+	 * Any leading common pathkeys could be useful for ordering because
+	 * we can use the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0c036209f0..fecf92af9d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1991,6 +2004,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5079,17 +5118,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
 
-	cost_sort(&sort_path, root, NIL,
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5670,9 +5716,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	node = makeNode(Sort);
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5686,6 +5735,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6032,6 +6112,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6766,6 +6882,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 17c5f086fb..40f0aaa19a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4924,8 +4924,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4964,29 +4964,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can take
+				 * advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 566ee96da8..c0370b2c70 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -652,6 +652,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 48b62a55de..42a2370071 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2686,6 +2686,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5d66fc2177..fc661c64c1 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2742,6 +2742,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2178e1cf5e..bb6221b991 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -940,6 +940,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index ab55e69975..79e4f88f99 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +250,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1250,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1323,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2724,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2774,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3271,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index c119fdf4fa..3e48593543 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 44f76082e9..11247559b6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1973,6 +1973,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfo	fcinfo; /* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2001,6 +2015,60 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	fullsort_instrument;
+	int64						fullsort_group_count;
+	TuplesortInstrumentation	prefixsort_instrument;
+	int64						prefixsort_group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64			n_fullsort_remaining;
+	Tuplesortstate	   *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate	   *prefixsort_state; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		fullsort_group_count;	/* number of groups with equal presorted keys */
+	int64		prefixsort_group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index bce2d59b0d..f72336e84a 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 23a06d718e..aab2fda7dc 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1618,6 +1618,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e355..bbf0739411 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -770,6 +770,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..b9d7a77e65 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54971..1470d15c78 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137d1d..e7a40cec3f 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -183,6 +183,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d774bc1152..cebeef2c60 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..3a58efdf91
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1160 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 10349ec29c..5f17afe0eb 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index fc0f14122b..8a17275846 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 68ac56acdb..f4949b400d 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b9df37412f
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,78 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.

#176

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#175)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Sep 27, 2019 at 08:31:30PM -0400, James Coleman wrote:

On Fri, Sep 13, 2019 at 10:51 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Sep 12, 2019 at 12:49:29PM -0300, Alvaro Herrera wrote:

On 2019-Jul-30, Tomas Vondra wrote:

I've decided to do a couple of experiments, trying to make my mind about
which modified places matter to diffrent queries. But instead of trying
to reverse engineer the queries, I've taken a different approach - I've
compiled a list of queries that I think are sensible and relevant, and
then planned them with incremental sort enabled in different places.

[...]

The list of queries (synthetic, but hopefully sufficiently realistic)
and a couple of scripts to collect the plans is in this repository:

https://github.com/tvondra/incremental-sort-tests-2

There's also a spreadsheet with a summary of results, with a visual
representation of which GUCs affect which queries.

OK, so we have that now. I suppose this spreadsheet now tells us which
places are useful and which aren't, at least for the queries that you've
tested. Dowe that mean that we want to get the patch to consider adding
paths only the places that your spreadsheet says are useful? I'm not
sure what the next steps are for this patch.

Yes. I think the spreadsheet call help us with answering two things:

1) places actually affecting the plan (all but three do)

2) redundant places (there are some cases where two GUCs produce the
same plan in the end)

To expand on this further, (1) should probably help us to be able to
write test cases.

Additionally, one big thing we still need that's somewhat external to
the patch is a good way to benchmark/a set of queries that we believe
are representative enough to be good benchmarks.

I'd really appreciate some input from you all on that particular
question; I feel like it's in some sense the biggest barrier to
getting the patch merged, but also the part where long experience in
the community/exposure to other use cases will probably be quite
valuable.

Hmmm. I don't think anyone will hand us a set of representative queries,
so I think we have two options:

1) Generate synthetic queries covering a wide range of cases (both when
incremental sort is expected to help and not). I think the script I've
used to determine which places do matter can be the starting point.

2) Look at some established benchmarks and see if some of the queries
could benefit from the incremental sort (possibly with some changes to
indexes in the usual schema). I plan to look at TPC-H / TPC-DS, but I
wonder if some OLTP benchmarks would be relevant too.

Of course, this does assume the query set makes sense and is somewhat
realistic, but I've tried to construct queries where that is true. We
may extend it over time, of course.

I think we've agreed to add incremental sort paths different places in
separate patches, to make review easier. So this may be a useful way to
decide which places to address first. I'd probably do it in this order:

- create_ordered_paths
- create_ordered_paths (parallel part)
- add_paths_to_grouping_rel
- ... not sure ...

but that's just a proposal. It'd give us most of the benefits, I think,
and we could also focus on the rest of the patch.

Certainly the first two seem like pretty obvious most necessary base
cases. I think supporting group bys also seems like a pretty standard
case, so at first glance I'd say this seems like a reasonable course
to me.

OK.

I'm going to start breaking up the patches in this thread into a
series in support of that. Since I've started a new thread with the
add_partial_path change, I'll include that patch here as part of this
series also. Do you think it's worth moving the tuplesort changes into
a standalone patch in the series also?

Probably. I'd do that at least for the review.

Attached is a rebased v31 now broken into the following:

- 001-consider-startup-cost-in-add-partial-path_v1.patch: From the
other thread (Tomas's patch unmodified)
- 002-incremental-sort_v31.patch: Updated base incremental sort patch

Besides rebasing, I've changed the enable_incrementalsort GUC to
prevent generating paths entirely rather than being cost-based, since
incremental sort is never absolutely necessary in the way regular sort
is.

OK, makes sense.

I'm hoping to add 003 soon with the initial parallel parts, but I'm
about out of time right now and wanted to get something out, so
sending this without that.

Side question: for the patch tester do I have to attach each part of
the series each time even if nothing's changed in several of them? And
does the vN number at the end need to stay the same for all of them?
My attachments to this email don't follow that... Also, since this
email changes patch naming, so I need to do anything to clear out the
old ones? (I suppose if not, then that would imply an answer to the
first question also.)

Please always send the whole patch series. Firstly, that's the only way
how the cfbot can apply and test the patches (it can't collect patches
scattered in different messages in the thread). Secondly, it's really
annoying for the reviewers to try to pick the matching bits.

Also, it's a good ide to always mark all parts with the same version
info, not v1 for one part and v31 for another one. I'd simply do
something like

git format-patch --suffix=-YYYYMMDD.patch master

or something like that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#177

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: James Coleman (#173)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Sep 27, 2019 at 01:50:30PM -0400, James Coleman wrote:

On Mon, Sep 9, 2019 at 5:55 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The "patched" column means all developer GUCs disabled, so it's expected
to produce the same plan as master (and it is). And then there's one
column for each developer GUC. If the column is just TRUE it means the
GUC does not affect any of the synthetic queries. There are 4 of them:

- devel_add_paths_to_grouping_rel_parallel
- devel_create_partial_grouping_paths
- devel_gather_grouping_paths
- devel_standard_join_search

The places controlled by those GUCs are either useless, or the query
affected by them is not included in the list of queries.

I'd previously found (in my reverse engineering efforts) the query:

select *
from tenk1 t1
join tenk1 t2 on t1.hundred = t2.hundred
join tenk1 t3 on t1.hundred = t3.hundred
order by t1.hundred, t1.twenty
limit 50;

can change plans to use incremental sort when
generate_useful_gather_paths() is added to standard_join_search().
Specifically, we get a merge join between t1 and t3 as the top level
(besides limit) node where the driving side of the join is a gather
merge with incremental sort. This does rely on these gucs set in the
test harness:

set local max_parallel_workers_per_gather=4;
set local min_parallel_table_scan_size=0;
set local parallel_tuple_cost=0;
set local parallel_setup_cost=0;

So I think we can reduce the number of unused gucs to 3.

OK. I'll try extending the set of synthetic queries in [1] to also do
soemthing like this and generate similar plans.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#178

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Tomas Vondra (#177)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Sep 29, 2019 at 01:00:49AM +0200, Tomas Vondra wrote:

OK. I'll try extending the set of synthetic queries in [1] to also do
soemthing like this and generate similar plans.

Any progress on that?

Please note that the latest patch does not apply anymore, so a rebase
is needed. I am switching the patch as waiting on author for now.
--
Michael

#179

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Michael Paquier (#178)

2 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Nov 29, 2019 at 03:01:46PM +0900, Michael Paquier wrote:

On Sun, Sep 29, 2019 at 01:00:49AM +0200, Tomas Vondra wrote:

OK. I'll try extending the set of synthetic queries in [1] to also do
soemthing like this and generate similar plans.

Any progress on that?

Please note that the latest patch does not apply anymore, so a rebase
is needed. I am switching the patch as waiting on author for now.
--

Ah, thanks for reminding me. I've added a couple more queries with two
joins (there only were queries with two joins, I haven't expected
another joint to make such difference, but seems I was wrong).

So yes, there seem to be 6 different GUCs / places where considering
incremental sort makes a difference (the numbers say how many of the
4960 tested combinations were affected)

- create_ordered_paths_parallel (50)
- create_partial_grouping_paths_2 (228)
- standard_join_search (94)
- add_paths_to_grouping_rel (2148)
- set_rel_pathlist (156)
- apply_scanjoin_target_to_paths (286)

Clearly some of the places are more important than others, plus there
are some overlaps (two GUCs producing the same plan, etc.).

Plus there are four GUCs that did not affect any queries at all:

- create_partial_grouping_paths
- gather_grouping_paths
- create_ordered_paths
- add_paths_to_grouping_rel_parallel

Anyway, this might serve as a way to prioritize the effort. All the
test changes are in the original repo at

https://github.com/tvondra/incremental-sort-tests-2

and I'm also attaching the rebased patches - the changes were pretty
minor, hopefully that helps others (all the patches with dev GUCs are in

https://github.com/tvondra/postgres/tree/incremental-sort-20191129

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Consider-low-startup-cost-when-adding-partial-pa-v32.patchtext/plain; charset=us-asciiDownload

From 14beb5a9f3282d452844cffa86bafb59df831343 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 29 Nov 2019 19:41:26 +0100
Subject: [PATCH 01/13] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with
the comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 47 ++++++++++-----------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 60c93ee7c5..f24ba587e5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -777,41 +777,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Unless pathkeys are incompatible, keep just one of the two paths. */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
-			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.21.0

0002-Implement-incremental-sort-v32.patchtext/plain; charset=us-asciiDownload

From 3f631de90190efa582085dcd84f1f1f395d10beb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 29 Nov 2019 19:47:28 +0100
Subject: [PATCH 02/13] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   13 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   33 +
 src/backend/executor/nodeIncrementalSort.c    | 1107 ++++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  194 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  129 +-
 src/backend/optimizer/plan/planner.c          |   71 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  194 ++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   30 +
 src/include/nodes/execnodes.h                 |   68 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   11 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1160 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   78 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3505 insertions(+), 115 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4ec13f3311..ba764671bb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4467,6 +4467,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..8a3bf8a4e5 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1215,6 +1219,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1841,6 +1848,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2175,12 +2188,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2191,7 +2221,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2215,7 +2245,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2284,7 +2314,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2341,7 +2371,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2354,13 +2384,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2400,9 +2431,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2612,6 +2647,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->fullsort_state != NULL)
+	{
+		/* TODO: is it valid to get space used etc. only once given we re-use the sort? */
+		/* TODO: maybe show average, min, max sort group size? */
+
+		Tuplesortstate *fullsort_state = incrsortstate->fullsort_state;
+		TuplesortInstrumentation fullsort_stats;
+		const char *fullsort_sortMethod;
+		const char *fullsort_spaceType;
+		Tuplesortstate *prefixsort_state = incrsortstate->prefixsort_state;
+		TuplesortInstrumentation prefixsort_stats;
+		const char *prefixsort_sortMethod;
+		const char *prefixsort_spaceType;
+
+		tuplesort_get_stats(fullsort_state, &fullsort_stats);
+		fullsort_sortMethod = tuplesort_method_name(fullsort_stats.sortMethod);
+		fullsort_spaceType = tuplesort_space_type_name(fullsort_stats.spaceType);
+		if (prefixsort_state != NULL)
+		{
+			tuplesort_get_stats(prefixsort_state, &prefixsort_stats);
+			prefixsort_sortMethod = tuplesort_method_name(prefixsort_stats.sortMethod);
+			prefixsort_spaceType = tuplesort_space_type_name(prefixsort_stats.spaceType);
+		}
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: Full: %s  %s: %ldkB",
+							 fullsort_sortMethod, fullsort_spaceType,
+							 fullsort_stats.spaceUsed);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %s %s: %ldkB\n",
+								 prefixsort_sortMethod, prefixsort_spaceType,
+								 prefixsort_stats.spaceUsed);
+			else
+				appendStringInfo(es->str, "\n");
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: Full:  %ld",
+							 incrsortstate->fullsort_group_count);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %ld\n",
+							 incrsortstate->prefixsort_group_count);
+			else
+				appendStringInfo(es->str, "\n");
+		}
+		else
+		{
+			/* TODO */
+			ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+			ExplainPropertyInteger("Full Sort Space Used", "kB",
+					fullsort_stats.spaceUsed, es);
+			ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+			ExplainPropertyInteger("Full Sort Groups", NULL,
+								   incrsortstate->fullsort_group_count, es);
+
+			if (prefixsort_state != NULL)
+			{
+				ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+				ExplainPropertyInteger("Prefix Sort Space Used", "kB",
+						prefixsort_stats.spaceUsed, es);
+				ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+				ExplainPropertyInteger("Prefix Sort Groups", NULL,
+									   incrsortstate->prefixsort_group_count, es);
+			}
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+				&incrsortstate->shared_info->sinfo[n];
+			TuplesortInstrumentation *fullsort_instrument;
+			const char *fullsort_sortMethod;
+			const char *fullsort_spaceType;
+			long		fullsort_spaceUsed;
+			int64		fullsort_group_count;
+			TuplesortInstrumentation *prefixsort_instrument;
+			const char *prefixsort_sortMethod;
+			const char *prefixsort_spaceType;
+			long		prefixsort_spaceUsed;
+			int64		prefixsort_group_count;
+
+			fullsort_instrument = &incsort_info->fullsort_instrument;
+			fullsort_group_count = incsort_info->fullsort_group_count;
+
+			prefixsort_instrument = &incsort_info->prefixsort_instrument;
+			prefixsort_group_count = incsort_info->prefixsort_group_count;
+
+			if (fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+
+			fullsort_sortMethod = tuplesort_method_name(
+					fullsort_instrument->sortMethod);
+			fullsort_spaceType = tuplesort_space_type_name(
+					fullsort_instrument->spaceType);
+			fullsort_spaceUsed = fullsort_instrument->spaceUsed;
+
+			if (prefixsort_instrument)
+			{
+				prefixsort_sortMethod = tuplesort_method_name(
+						prefixsort_instrument->sortMethod);
+				prefixsort_spaceType = tuplesort_space_type_name(
+						prefixsort_instrument->spaceType);
+				prefixsort_spaceUsed = prefixsort_instrument->spaceUsed;
+			}
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d: Full Sort Method: %s  %s: %ldkB  Groups: %ld",
+								 n, fullsort_sortMethod, fullsort_spaceType,
+								 fullsort_spaceUsed, fullsort_group_count);
+				if (prefixsort_instrument)
+					appendStringInfo(es->str,
+									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+									 prefixsort_sortMethod, prefixsort_spaceType,
+									 prefixsort_spaceUsed, prefixsort_group_count);
+				else
+					appendStringInfo(es->str, "\n");
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+				ExplainPropertyInteger("Full Sort Space Used", "kB", fullsort_spaceUsed, es);
+				ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+				ExplainPropertyInteger("Full Sort Groups", NULL, fullsort_group_count, es);
+				if (prefixsort_instrument)
+				{
+					ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+					ExplainPropertyInteger("Prefix Sort Space Used", "kB", prefixsort_spaceUsed, es);
+					ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+					ExplainPropertyInteger("Prefix Sort Groups", NULL, prefixsort_group_count, es);
+				}
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..e06258e134 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -63,6 +63,7 @@ OBJS = \
 	nodeSeqscan.o \
 	nodeSetOp.o \
 	nodeSort.o \
+	nodeIncrementalSort.o \
 	nodeSubplan.o \
 	nodeSubqueryscan.o \
 	nodeTableFuncscan.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 779d3dccea..0d0fec82f1 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index b256642665..2aedc2b1af 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -955,6 +964,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1015,6 +1025,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1301,6 +1314,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7a6e285149..a5e928f8f1 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState  *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..c3b903e568
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1107 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int presortedCols, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail keys
+	 * are more likely to change. Therefore we do our comparison starting from
+	 * the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(IncrementalSortState *node)
+{
+	ScanDirection		dir;
+	int64 nTuples = 0;
+	bool lastTuple = false;
+	bool firstTuple = true;
+	TupleDesc		    tupDesc;
+	PlanState		   *outerNode;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal
+		 * and thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(
+				tupDesc,
+				plannode->sort.numCols - presortedCols,
+				&(plannode->sort.sortColIdx[presortedCols]),
+				&(plannode->sort.sortOperators[presortedCols]),
+				&(plannode->sort.collations[presortedCols]),
+				&(plannode->sort.nullsFirst[presortedCols]),
+				work_mem,
+				NULL,
+				false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure
+	 * the tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+				node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+					ScanDirectionIsForward(dir),
+					false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to save the
+			 * first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/* The tuple isn't part of the current batch so we need to carry
+				 * it over into the next set up tuples we transfer out of the full
+				 * sort tuplesort into the presorted prefix tuplesort. We don't
+				 * actually have to do anything special to save the tuple since
+				 * we've already loaded it into the node->transfer_tuple slot, and,
+				 * even though that slot points to memory inside the full sort
+				 * tuplesort, we can't reset that tuplesort anyway until we've
+				 * fully transferred out of its tuples, so this reference is safe.
+				 * We do need to reset the group pivot tuple though since we've
+				 * finished the current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch
+		 * is in the same prefix key group and moved all of those tuples into
+		 * the presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/* Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume
+		 * we have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner node
+		 * read out all of those tuples, and then come back around to find
+		 * another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *read_sortstate;
+	Tuplesortstate	   *fullsort_state;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+			|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->finished)
+			/*
+			 * TODO: there isn't a good test case for the node->finished
+			 * case directly, but lots of other stuff fails if it's not
+			 * there. If the outer node will fail when trying to fetch
+			 * too many tuples, then things break if that test isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the full
+			 * sort tuplesort. The first call to switchToPresortedPrefixMode()
+			 * pulled the one of those groups out, and we've returned those
+			 * tuples to the inner node, but if we tuples remaining in that
+			 * tuplesort (i.e., n_fullsort_remaining > 0) at this point we
+			 * need to do that again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(node);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're not
+			 * in the middle of transitioning into presorted prefix sort mode,
+			 * then it's time to start the process all over again by building
+			 * new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(
+					tupDesc,
+					plannode->sort.numCols,
+					plannode->sort.sortColIdx,
+					plannode->sort.sortOperators,
+					plannode->sort.collations,
+					plannode->sort.nullsFirst,
+					work_mem,
+					NULL,
+					false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64 currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n heap
+			 * sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/* Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't assume
+			 * the group pivot tuple will reamin the same -- unless we're using
+			 * a minimum group size of 1, in which case the pivot is obviously
+			 * still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us anymore tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then don't
+				 * both with checking for inclusion in the current prefix group
+				 * since a large number of very tiny sorts is inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this
+					 * sort group. Instead we need to carry it over to the
+					 * next group. We use the group_pivot slot as a temp
+					 * container for that purpose even though we won't actually
+					 * treat it as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound remaining
+						 * is (original bound - n), so store the current number
+						 * of processed tuples for use in configuring sorting
+						 * bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+					tuplesort_performsort(fullsort_state);
+					node->fullsort_group_count++;
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found
+			 * a large group of tuples having a single prefix key (as long
+			 * as the last tuple didn't shift us into reading from the full
+			 * sort mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+					node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into the
+				 * tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n heapsort mode
+				 * then we will only be able to retrieve currentBound tuples (since the
+				 * tuplesort will have only retained the top-n tuples). This is safe even
+				 * though we haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the retained ones,
+				 * and we're already contractually guaranteed to not need any more than the
+				 * currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64 currentBound = node->bound - node->bound_Done;
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the fullsort
+				 * to presorted prefix sort (we might have multiple prefix key
+				 * groups, so we need a way to see if we've actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(node);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop out
+				 * of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've already
+		 * established a current group pivot tuple (but wasn't carried over;
+		 * it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix tuplesort
+				 * until we find changed prefix keys. Only then can we guarantee
+				 * sort stability of the tuples we've already accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		IncrementalSortInfo *incsort_info =
+			&node->shared_info->sinfo[ParallelWorkerNumber];
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		tuplesort_get_stats(fullsort_state, &incsort_info->fullsort_instrument);
+		incsort_info->fullsort_group_count = node->fullsort_group_count;
+
+		if (node->prefixsort_state)
+		{
+			tuplesort_get_stats(node->prefixsort_state,
+					&incsort_info->prefixsort_instrument);
+			incsort_info->prefixsort_group_count = node->prefixsort_group_count;
+		}
+	}
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_group_count = 0;
+	incrsortstate->prefixsort_group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+		tuplesort_end(node->fullsort_state);
+	node->fullsort_state = NULL;
+	if (node->prefixsort_state != NULL)
+		tuplesort_end(node->prefixsort_state);
+	node->prefixsort_state = NULL;
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 92855278ad..3ea1b1bca1 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 2f267e4bb6..5c3aef32c4 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -924,6 +924,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -935,13 +953,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4875,6 +4909,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b0dcd02ff6..9801f8fa2d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -833,10 +833,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -846,6 +844,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3771,6 +3787,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..c2847e8d3f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2117,12 +2117,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2131,6 +2132,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2765,6 +2792,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index db3a68a51d..adca8322aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c5f6593485..f6ec405d58 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 2f4fea241a..e88a87e189 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int	n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none.
+	 * Any leading common pathkeys could be useful for ordering because
+	 * we can use the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index aee81bd755..39ae858894 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1991,6 +2004,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5082,17 +5121,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
 
-	cost_sort(&sort_path, root, NIL,
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5673,9 +5719,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	node = makeNode(Sort);
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5689,6 +5738,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6035,6 +6115,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6769,6 +6885,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7fe11b59a0..a153e77247 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4923,8 +4923,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4963,29 +4963,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can take
+				 * advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 566ee96da8..c0370b2c70 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -652,6 +652,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 48b62a55de..42a2370071 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2686,6 +2686,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f24ba587e5..278189a4a9 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2741,6 +2741,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5fccc9683e..4e7959edb7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -947,6 +947,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7947d2bca0..cb1aacb207 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +250,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1250,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1323,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2724,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2774,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3271,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index c119fdf4fa..3e48593543 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6eb647290b..f4d1104d4d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1972,6 +1972,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfo	fcinfo; /* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2000,6 +2014,60 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	fullsort_instrument;
+	int64						fullsort_group_count;
+	TuplesortInstrumentation	prefixsort_instrument;
+	int64						prefixsort_group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64			n_fullsort_remaining;
+	Tuplesortstate	   *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate	   *prefixsort_state; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		fullsort_group_count;	/* number of groups with equal presorted keys */
+	int64		prefixsort_group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index bce2d59b0d..f72336e84a 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 23a06d718e..aab2fda7dc 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1618,6 +1618,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e355..bbf0739411 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -770,6 +770,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..b9d7a77e65 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54971..1470d15c78 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index c6c34630c2..a3cd817cb5 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d774bc1152..cebeef2c60 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..3a58efdf91
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1160 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 10349ec29c..5f17afe0eb 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d33a4e143d..195c570efd 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index f86f5c5682..7d72b8979e 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b9df37412f
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,78 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.21.0

#180

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#179)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Hi,

This patch has been marked as WoA since end of November, and there has
been no discussion/reviews since then :-( Based on off-list discussion
with James I don't think that's going to change in this CF, so I'll move
it to the next CF.

I plan to work on the planner part of this patch before 2020-03, with
the hope it can still make it into 13.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#181

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#180)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

This patch has been marked as WoA since end of November, and there has
been no discussion/reviews since then :-( Based on off-list discussion
with James I don't think that's going to change in this CF, so I'll move
it to the next CF.

I plan to work on the planner part of this patch before 2020-03, with
the hope it can still make it into 13.

In that off-list discussion I'd mentioned to Tomas that I would still
like to work on this, just my other responsibilities at work have left
me little time to work on the most important remaining part of this
(the planner parts) since that requires a fair amount of focus and
time.

That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

James

#182

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#181)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jan 21, 2020 at 09:37:01AM -0500, James Coleman wrote:

On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

This patch has been marked as WoA since end of November, and there has
been no discussion/reviews since then :-( Based on off-list discussion
with James I don't think that's going to change in this CF, so I'll move
it to the next CF.

I plan to work on the planner part of this patch before 2020-03, with
the hope it can still make it into 13.

In that off-list discussion I'd mentioned to Tomas that I would still
like to work on this, just my other responsibilities at work have left
me little time to work on the most important remaining part of this
(the planner parts) since that requires a fair amount of focus and
time.

That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

Sure, sorry if that was not clear from my message - you're of course
more than welcome to continue working on this. My understanding is that
won't happen by the end of this CF, hence the move to 2020-03.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#183

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#182)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jan 21, 2020 at 9:58 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jan 21, 2020 at 09:37:01AM -0500, James Coleman wrote:

On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

This patch has been marked as WoA since end of November, and there has
been no discussion/reviews since then :-( Based on off-list discussion
with James I don't think that's going to change in this CF, so I'll move
it to the next CF.

I plan to work on the planner part of this patch before 2020-03, with
the hope it can still make it into 13.

In that off-list discussion I'd mentioned to Tomas that I would still
like to work on this, just my other responsibilities at work have left
me little time to work on the most important remaining part of this
(the planner parts) since that requires a fair amount of focus and
time.

That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

Sure, sorry if that was not clear from my message - you're of course
more than welcome to continue working on this. My understanding is that
won't happen by the end of this CF, hence the move to 2020-03.

Oh, yeah, I probably didn't word that reply well -- I just wanted to
add some additional detail to what you had already said.

Thanks for your work on this!

James

#184

David Steele

david@pgmasters.net

almost 6 years ago

In reply to: James Coleman (#183)

Re: [PATCH] Incremental sort

James and Tomas,

On 1/21/20 10:03 AM, James Coleman wrote:

On Tue, Jan 21, 2020 at 9:58 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jan 21, 2020 at 09:37:01AM -0500, James Coleman wrote:

On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

This patch has been marked as WoA since end of November, and there has
been no discussion/reviews since then :-( Based on off-list discussion
with James I don't think that's going to change in this CF, so I'll move
it to the next CF.

I plan to work on the planner part of this patch before 2020-03, with
the hope it can still make it into 13.

In that off-list discussion I'd mentioned to Tomas that I would still
like to work on this, just my other responsibilities at work have left
me little time to work on the most important remaining part of this
(the planner parts) since that requires a fair amount of focus and
time.

That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

Sure, sorry if that was not clear from my message - you're of course
more than welcome to continue working on this. My understanding is that
won't happen by the end of this CF, hence the move to 2020-03.

Oh, yeah, I probably didn't word that reply well -- I just wanted to
add some additional detail to what you had already said.

Thanks for your work on this!

It doesn't look there has been much movement on this patch for the last
few CFs. Are one or both of you planning to work on this for v13? Or
should we mark it for v14 and/or move it to the next CF.

Regards,
--
-David
david@pgmasters.net

#185

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: David Steele (#184)

Re: [PATCH] Incremental sort

On Tue, Mar 03, 2020 at 12:17:22PM -0500, David Steele wrote:

James and Tomas,

On 1/21/20 10:03 AM, James Coleman wrote:

On Tue, Jan 21, 2020 at 9:58 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jan 21, 2020 at 09:37:01AM -0500, James Coleman wrote:

On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

This patch has been marked as WoA since end of November, and there has
been no discussion/reviews since then :-( Based on off-list discussion
with James I don't think that's going to change in this CF, so I'll move
it to the next CF.

I plan to work on the planner part of this patch before 2020-03, with
the hope it can still make it into 13.

In that off-list discussion I'd mentioned to Tomas that I would still
like to work on this, just my other responsibilities at work have left
me little time to work on the most important remaining part of this
(the planner parts) since that requires a fair amount of focus and
time.

That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

Sure, sorry if that was not clear from my message - you're of course
more than welcome to continue working on this. My understanding is that
won't happen by the end of this CF, hence the move to 2020-03.

Oh, yeah, I probably didn't word that reply well -- I just wanted to
add some additional detail to what you had already said.

Thanks for your work on this!

It doesn't look there has been much movement on this patch for the
last few CFs. Are one or both of you planning to work on this for
v13? Or should we mark it for v14 and/or move it to the next CF.

I'm currently working on it, I plan to submit a new patch version
shortly - hopefully by the end of this week.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#186

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#185)

Re: [PATCH] Incremental sort

On Tue, Mar 3, 2020 at 1:43 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Tue, Mar 03, 2020 at 12:17:22PM -0500, David Steele wrote:

James and Tomas,

On 1/21/20 10:03 AM, James Coleman wrote:

On Tue, Jan 21, 2020 at 9:58 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jan 21, 2020 at 09:37:01AM -0500, James Coleman wrote:

On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

This patch has been marked as WoA since end of November, and there has
been no discussion/reviews since then :-( Based on off-list discussion
with James I don't think that's going to change in this CF, so I'll

move

it to the next CF.

I plan to work on the planner part of this patch before 2020-03, with
the hope it can still make it into 13.

In that off-list discussion I'd mentioned to Tomas that I would still
like to work on this, just my other responsibilities at work have left
me little time to work on the most important remaining part of this
(the planner parts) since that requires a fair amount of focus and
time.

That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

Sure, sorry if that was not clear from my message - you're of course
more than welcome to continue working on this. My understanding is that
won't happen by the end of this CF, hence the move to 2020-03.

Oh, yeah, I probably didn't word that reply well -- I just wanted to
add some additional detail to what you had already said.

Thanks for your work on this!

It doesn't look there has been much movement on this patch for the
last few CFs. Are one or both of you planning to work on this for
v13? Or should we mark it for v14 and/or move it to the next CF.

I'm currently working on it, I plan to submit a new patch version
shortly - hopefully by the end of this week.

Tomas, thanks much for working on this.

I haven't had a lot of time to dedicate to this, but I do hope to soon
(late this week or next), in addition to the planner stuff I believe Tomas
is working on, push improvements to the EXPLAIN ANALYZE output.

James

#187

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#181)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jan 21, 2020 at 9:37 AM James Coleman <jtc331@gmail.com> wrote:

That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

I'm looking at this now, and realized that at least for parallel plans the
current patch tracks the tuplesort instrumentation whether or not an
EXPLAIN ANALYZE is in process.

Is this fairly standard for executor nodes? Or is it expected to condition
some of this tracking based on whether or not an ANALYZE is running?

I'm found EXEC_FLAG_EXPLAIN_ONLY but no parallel for analyze. Similarly the
InstrumentOption bit flags on the executor state seems to indicate whether
specific ANALYZE options should be enabled, but I haven't yet seen anything
conditioned solely on whether an ANALYZE is in flight. Could someone point
me in the right direction is this is expected?

Thanks,
James

#188

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: James Coleman (#187)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

James Coleman <jtc331@gmail.com> writes:

I'm looking at this now, and realized that at least for parallel plans the
current patch tracks the tuplesort instrumentation whether or not an
EXPLAIN ANALYZE is in process.

Is this fairly standard for executor nodes? Or is it expected to condition
some of this tracking based on whether or not an ANALYZE is running?

No, it's entirely not standard. Maybe you could make an argument that
it's too cheap to bother making it conditional, but without a convincing
argument for that, it needs to be conditional.

regards, tom lane

#189

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tom Lane (#188)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Mar 5, 2020 at 5:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

James Coleman <jtc331@gmail.com> writes:

I'm looking at this now, and realized that at least for parallel plans

the

current patch tracks the tuplesort instrumentation whether or not an
EXPLAIN ANALYZE is in process.

Is this fairly standard for executor nodes? Or is it expected to

condition

some of this tracking based on whether or not an ANALYZE is running?

No, it's entirely not standard. Maybe you could make an argument that
it's too cheap to bother making it conditional, but without a convincing
argument for that, it needs to be conditional.

That's what I figured, but as I mentioned I've having trouble figuring out
how the fact that an analyze is in flight is determined. I assume it's
something that lives of the EState or similar, but I'm not seeing anything
obvious.

Thanks
James

#190

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: James Coleman (#189)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

James Coleman <jtc331@gmail.com> writes:

That's what I figured, but as I mentioned I've having trouble figuring out
how the fact that an analyze is in flight is determined. I assume it's
something that lives of the EState or similar, but I'm not seeing anything
obvious.

AFAIR, it's just whether or not the current planstate node has an
instrumentation struct attached.

regards, tom lane

#191

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tom Lane (#190)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Mar 5, 2020 at 6:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

James Coleman <jtc331@gmail.com> writes:

That's what I figured, but as I mentioned I've having trouble figuring

out

how the fact that an analyze is in flight is determined. I assume it's
something that lives of the EState or similar, but I'm not seeing

anything

obvious.

AFAIR, it's just whether or not the current planstate node has an
instrumentation struct attached.

Oh, that's easy. Thanks for pointing that out. I'll be attaching a new
patch soon incorporating that check.

James

#192

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#181)

3 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Jan 21, 2020 at 9:37 AM James Coleman <jtc331@gmail.com> wrote:

That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

James

I'm attaching a rebased patch revision + a new commit that reworks EXPLAIN
output. I left that patch as separate for now so that it was easy enough to
see the difference, and so that as Tomas is working on stuff in parallel I
don't unnecessarily cause merge conflicts for now, but next patch revision
(assuming the EXPLAIN change looks good) can just incorporate it into the
base patch.

Here's what I've changed:

- The stats necessary for ANALYZE are now only kept if the PlanState has a
non-null instrument field (thanks to Tom for pointing out this as the
correct way to check that ANALYZE is in flight). I did leave lines like
`node->incsort_info.fullsortGroupInfo.groupCount++;` unguarded by that `if`
since it seems like practically zero overhead (and almost equal to check
the condition), but if anyone disagrees, I'm happy to change it.
Additionally those lines (if ANALYZE is not in flight) are technically
operating on variables that haven't explicitly been initialized in the Init
function; please tell me if that's actually an issue given the are counters
and we won't be using them in that case.
- A good bit of cleanup on how parallel workers are output (I believe there
was some duplicative group opening and also inconsistent text output with
other multi-worker explain nodes). I haven't had a chance to test this yet,
thought, so there could be bugs.
- I left and XXX in the patch to note a place I wanted extra eyes. The
original patch ignored workers if the tuplesort for that worker still had
an in-progress status, but from what I can tell that doesn't make a lot of
sense given that we re-use the same tuplesort multiple times. So a parallel
worker (I think) could have returned from the first batch, but then be
in-progress still on the 2nd batch, and we wouldn't want to ignore that
worker. As a replacement I'm now checking that at least one of the fullsort
and prefixsort group count stats are greater than 0 (to imply we've sorted
at least one batch).
- I also left a TODO wondering if we should break out the instrumentation
into a separate function; it seems like a decent sized chunk of cleanly
extractable code; I suppose that's always a bit of personal preference, so
anyone who wants to weigh in gets a vote :)
- The previous implementation assumed the most recent tuplesort usage had
the correct information for memory/disk usage and sort implementation, but
again, since we re-use, that doesn't make a lot of sense. Instead I now
output all sort methods used as well as maximum and average disk and memory
usage.

Here's example output:
-> Incremental Sort
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 4 (Methods: quicksort) Memory: 26kB (avg), 26kB
(max)
-> Index Scan using idx_t_a...

You'd have an additional line for "Presorted groups: ..." if any are
present to parallel "Full-sort groups".

I haven't yet run pg formatting, but I didn't want to modify the base patch
given other work on it is in flight.

James

Attachments:

v33-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v33-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 269df4be255b3b117e1ff27ba9b340594aacaf2c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v33 1/3] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 47 ++++++++++-----------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d9ce516211..3e836e6e1c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -777,41 +777,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Unless pathkeys are incompatible, keep just one of the two paths. */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
-			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 

base-commit: b9c3de62cbc9c6993ceac0de99985cf051e91c88
-- 
2.17.1

v33-0003-Rework-EXPLAIN-for-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v33-0003-Rework-EXPLAIN-for-incremental-sort.patchDownload

From 22f655e7989b77b593ad42561dd533af059c4d67 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sat, 7 Mar 2020 17:09:39 -0500
Subject: [PATCH v33 3/3] Rework EXPLAIN for incremental sort

---
 src/backend/commands/explain.c             | 253 ++++++++++-----------
 src/backend/executor/nodeIncrementalSort.c |  98 ++++++--
 src/include/nodes/execnodes.h              |  29 ++-
 3 files changed, 221 insertions(+), 159 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 14aedec919..8262c54e6a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2701,80 +2701,114 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
-/*
- * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
- */
+
 static void
-show_incremental_sort_info(IncrementalSortState *incrsortstate,
-						   ExplainState *es)
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+						   const char *groupLabel, ExplainState *es)
 {
-	if (es->analyze && incrsortstate->sort_Done &&
-		incrsortstate->fullsort_state != NULL)
+	const char *sortMethodName;
+	const char *spaceTypeName;
+	ListCell *methodCell;
+	int methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
-		/* TODO: is it valid to get space used etc. only once given we re-use the sort? */
-		/* TODO: maybe show average, min, max sort group size? */
-
-		Tuplesortstate *fullsort_state = incrsortstate->fullsort_state;
-		TuplesortInstrumentation fullsort_stats;
-		const char *fullsort_sortMethod;
-		const char *fullsort_spaceType;
-		Tuplesortstate *prefixsort_state = incrsortstate->prefixsort_state;
-		TuplesortInstrumentation prefixsort_stats;
-		const char *prefixsort_sortMethod;
-		const char *prefixsort_spaceType;
-
-		tuplesort_get_stats(fullsort_state, &fullsort_stats);
-		fullsort_sortMethod = tuplesort_method_name(fullsort_stats.sortMethod);
-		fullsort_spaceType = tuplesort_space_type_name(fullsort_stats.spaceType);
-		if (prefixsort_state != NULL)
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
 		{
-			tuplesort_get_stats(prefixsort_state, &prefixsort_stats);
-			prefixsort_sortMethod = tuplesort_method_name(prefixsort_stats.sortMethod);
-			prefixsort_spaceType = tuplesort_space_type_name(prefixsort_stats.spaceType);
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
 		}
+		appendStringInfo(es->str, ")");
 
-		if (es->format == EXPLAIN_FORMAT_TEXT)
+		if (groupInfo->maxMemorySpaceUsed > 0)
 		{
-			appendStringInfoSpaces(es->str, es->indent * 2);
-			appendStringInfo(es->str, "Sort Method: Full: %s  %s: %ldkB",
-							 fullsort_sortMethod, fullsort_spaceType,
-							 fullsort_stats.spaceUsed);
-			if (prefixsort_state != NULL)
-				appendStringInfo(es->str, ", Prefix-only: %s %s: %ldkB\n",
-								 prefixsort_sortMethod, prefixsort_spaceType,
-								 prefixsort_stats.spaceUsed);
-			else
-				appendStringInfo(es->str, "\n");
-			appendStringInfoSpaces(es->str, es->indent * 2);
-			appendStringInfo(es->str, "Sort Groups: Full:  %ld",
-							 incrsortstate->fullsort_group_count);
-			if (prefixsort_state != NULL)
-				appendStringInfo(es->str, ", Prefix-only: %ld\n",
-							 incrsortstate->prefixsort_group_count);
-			else
-				appendStringInfo(es->str, "\n");
+			long avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
 		}
-		else
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
 		{
-			/* TODO */
-			ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
-			ExplainPropertyInteger("Full Sort Space Used", "kB",
-					fullsort_stats.spaceUsed, es);
-			ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
-			ExplainPropertyInteger("Full Sort Groups", NULL,
-								   incrsortstate->fullsort_group_count, es);
-
-			if (prefixsort_state != NULL)
-			{
-				ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
-				ExplainPropertyInteger("Prefix Sort Space Used", "kB",
-						prefixsort_stats.spaceUsed, es);
-				ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
-				ExplainPropertyInteger("Prefix Sort Groups", NULL,
-									   incrsortstate->prefixsort_group_count, es);
-			}
+			long avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			methodNames = lappend(methodNames, sortMethodName);
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+					groupInfo->maxMemorySpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+					groupInfo->maxDiskSpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
 		}
+
+		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
 	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	if (!(es->analyze && incrsortstate->sort_Done))
+		return;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+	if (fullsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
 
 	if (incrsortstate->shared_info != NULL)
 	{
@@ -2785,79 +2819,36 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 		{
 			IncrementalSortInfo *incsort_info =
 				&incrsortstate->shared_info->sinfo[n];
-			TuplesortInstrumentation *fullsort_instrument;
-			const char *fullsort_sortMethod;
-			const char *fullsort_spaceType;
-			long		fullsort_spaceUsed;
-			int64		fullsort_group_count;
-			TuplesortInstrumentation *prefixsort_instrument;
-			const char *prefixsort_sortMethod;
-			const char *prefixsort_spaceType;
-			long		prefixsort_spaceUsed;
-			int64		prefixsort_group_count;
-
-			fullsort_instrument = &incsort_info->fullsort_instrument;
-			fullsort_group_count = incsort_info->fullsort_group_count;
-
-			prefixsort_instrument = &incsort_info->prefixsort_instrument;
-			prefixsort_group_count = incsort_info->prefixsort_group_count;
-
-			if (fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
-				continue;		/* ignore any unfilled slots */
-
-			fullsort_sortMethod = tuplesort_method_name(
-					fullsort_instrument->sortMethod);
-			fullsort_spaceType = tuplesort_space_type_name(
-					fullsort_instrument->spaceType);
-			fullsort_spaceUsed = fullsort_instrument->spaceUsed;
+			/*
+			 * XXX: The previous version of the patch chcked:
+			 * fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+			 * and continued if the condition was true (with the comment "ignore
+			 * any unfilled slots").
+			 * I'm not convinced that makes sense since the same sort instrument
+			 * can have been used multiple times, so the last time it being used
+			 * being still in progress, doesn't seem to be relevant.
+			 * Instead I'm now checking to see if the group count for each group
+			 * info is 0. If both are 0, then we exclude the worker since it
+			 * didn't contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+					prefixsortGroupInfo->groupCount == 0)
+				continue;
 
-			if (prefixsort_instrument)
+			if (!opened_group)
 			{
-				prefixsort_sortMethod = tuplesort_method_name(
-						prefixsort_instrument->sortMethod);
-				prefixsort_spaceType = tuplesort_space_type_name(
-						prefixsort_instrument->spaceType);
-				prefixsort_spaceUsed = prefixsort_instrument->spaceUsed;
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
 			}
 
-			if (es->format == EXPLAIN_FORMAT_TEXT)
-			{
-				appendStringInfoSpaces(es->str, es->indent * 2);
-				appendStringInfo(es->str,
-								 "Worker %d: Full Sort Method: %s  %s: %ldkB  Groups: %ld",
-								 n, fullsort_sortMethod, fullsort_spaceType,
-								 fullsort_spaceUsed, fullsort_group_count);
-				if (prefixsort_instrument)
-					appendStringInfo(es->str,
-									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
-									 prefixsort_sortMethod, prefixsort_spaceType,
-									 prefixsort_spaceUsed, prefixsort_group_count);
-				else
-					appendStringInfo(es->str, "\n");
-			}
-			else
-			{
-				if (!opened_group)
-				{
-					ExplainOpenGroup("Workers", "Workers", false, es);
-					opened_group = true;
-				}
-				ExplainOpenGroup("Worker", NULL, true, es);
-				ExplainPropertyInteger("Worker Number", NULL, n, es);
-				ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
-				ExplainPropertyInteger("Full Sort Space Used", "kB", fullsort_spaceUsed, es);
-				ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
-				ExplainPropertyInteger("Full Sort Groups", NULL, fullsort_group_count, es);
-				if (prefixsort_instrument)
-				{
-					ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
-					ExplainPropertyInteger("Prefix Sort Space Used", "kB", prefixsort_spaceUsed, es);
-					ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
-					ExplainPropertyInteger("Prefix Sort Groups", NULL, prefixsort_group_count, es);
-				}
-				ExplainCloseGroup("Worker", NULL, true, es);
-			}
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
 		}
+
 		if (opened_group)
 			ExplainCloseGroup("Workers", "Workers", false, es);
 	}
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index c3b903e568..bc8b5a798c 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -355,7 +355,7 @@ switchToPresortedPrefixMode(IncrementalSortState *node)
 		 */
 		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
 		tuplesort_performsort(node->prefixsort_state);
-		node->prefixsort_group_count++;
+		node->incsort_info.prefixsortGroupInfo.groupCount++;
 
 		if (node->bounded)
 		{
@@ -602,7 +602,7 @@ ExecIncrementalSort(PlanState *pstate)
 
 				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
 				tuplesort_performsort(fullsort_state);
-				node->fullsort_group_count++;
+				node->incsort_info.fullsortGroupInfo.groupCount++;
 
 				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
 				node->execution_status = INCSORT_READFULLSORT;
@@ -673,7 +673,7 @@ ExecIncrementalSort(PlanState *pstate)
 					 */
 					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
 					tuplesort_performsort(fullsort_state);
-					node->fullsort_group_count++;
+					node->incsort_info.fullsortGroupInfo.groupCount++;
 					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
 					node->execution_status = INCSORT_READFULLSORT;
 					break;
@@ -705,7 +705,7 @@ ExecIncrementalSort(PlanState *pstate)
 				 */
 				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
 				tuplesort_performsort(fullsort_state);
-				node->fullsort_group_count++;
+				node->incsort_info.fullsortGroupInfo.groupCount++;
 
 				/*
 				 * If the full sort tuplesort happened to switch into top-n heapsort mode
@@ -801,7 +801,7 @@ ExecIncrementalSort(PlanState *pstate)
 		/* Perform the sort and return the tuples to the inner plan nodes. */
 		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
 		tuplesort_performsort(node->prefixsort_state);
-		node->prefixsort_group_count++;
+		node->incsort_info.prefixsortGroupInfo.groupCount++;
 		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
 		node->execution_status = INCSORT_READPREFIXSORT;
 
@@ -828,23 +828,67 @@ ExecIncrementalSort(PlanState *pstate)
 	 */
 	node->sort_Done = true;
 
-	/* Record shared stats if we're a parallel worker. */
-	if (node->shared_info && node->am_worker)
+	/* TODO: break this out into function? */
+	if (pstate->instrument != NULL)
 	{
-		IncrementalSortInfo *incsort_info =
-			&node->shared_info->sinfo[ParallelWorkerNumber];
+		IncrementalSortGroupInfo *groupInfo;
+		TuplesortInstrumentation	sort_instr;
 
-		Assert(IsParallelWorker());
-		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+		groupInfo = &node->incsort_info.fullsortGroupInfo;
+		tuplesort_get_stats(fullsort_state, &sort_instr);
+		switch (sort_instr.spaceType)
+		{
+			case SORT_SPACE_TYPE_DISK:
+				groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+				if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+					groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+				break;
+			case SORT_SPACE_TYPE_MEMORY:
+				groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+				if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+					groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+				break;
+		}
 
-		tuplesort_get_stats(fullsort_state, &incsort_info->fullsort_instrument);
-		incsort_info->fullsort_group_count = node->fullsort_group_count;
+		if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+			groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+					sort_instr.sortMethod);
 
 		if (node->prefixsort_state)
 		{
-			tuplesort_get_stats(node->prefixsort_state,
-					&incsort_info->prefixsort_instrument);
-			incsort_info->prefixsort_group_count = node->prefixsort_group_count;
+			groupInfo = &node->incsort_info.prefixsortGroupInfo;
+			tuplesort_get_stats(node->prefixsort_state, &sort_instr);
+			switch (sort_instr.spaceType)
+			{
+				case SORT_SPACE_TYPE_DISK:
+					groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+					if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+						groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+					break;
+				case SORT_SPACE_TYPE_MEMORY:
+					groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+					if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+						groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+					break;
+			}
+
+			if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+				groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+						sort_instr.sortMethod);
+		}
+
+		/* Record shared stats if we're a parallel worker. */
+		if (node->shared_info && node->am_worker)
+		{
+			Assert(IsParallelWorker());
+			Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+			memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+					&node->incsort_info, sizeof(IncrementalSortInfo));
 		}
 	}
 
@@ -900,10 +944,28 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 	incrsortstate->transfer_tuple = NULL;
 	incrsortstate->n_fullsort_remaining = 0;
 	incrsortstate->bound_Done = 0;
-	incrsortstate->fullsort_group_count = 0;
-	incrsortstate->prefixsort_group_count = 0;
 	incrsortstate->presorted_keys = NULL;
 
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+			&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+			&incrsortstate->incsort_info.prefixsortGroupInfo;
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
 	/*
 	 * Miscellaneous initialization
 	 *
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f905e384a2..0934482123 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2022,18 +2022,26 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
-/* ----------------
- *	 Shared memory container for per-worker incremental sort information
- * ----------------
- */
+typedef struct IncrementalSortGroupInfo
+{
+	int64 groupCount;
+	long maxDiskSpaceUsed;
+	long totalDiskSpaceUsed;
+	long maxMemorySpaceUsed;
+	long totalMemorySpaceUsed;
+	List *sortMethods;
+} IncrementalSortGroupInfo;
+
 typedef struct IncrementalSortInfo
 {
-	TuplesortInstrumentation	fullsort_instrument;
-	int64						fullsort_group_count;
-	TuplesortInstrumentation	prefixsort_instrument;
-	int64						prefixsort_group_count;
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
 } IncrementalSortInfo;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
 typedef struct SharedIncrementalSortInfo
 {
 	int							num_workers;
@@ -2067,8 +2075,9 @@ typedef struct IncrementalSortState
 	Tuplesortstate	   *prefixsort_state; /* private state of tuplesort.c */
 	/* the keys by which the input path is already sorted */
 	PresortedKeyData *presorted_keys;
-	int64		fullsort_group_count;	/* number of groups with equal presorted keys */
-	int64		prefixsort_group_count;	/* number of groups with equal presorted keys */
+
+	IncrementalSortInfo incsort_info;
+
 	/* slot for pivot tuple defining values of presorted keys within group */
 	TupleTableSlot *group_pivot;
 	TupleTableSlot *transfer_tuple;
-- 
2.17.1

v33-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v33-0002-Implement-incremental-sort.patchDownload

From 3773b33250d78c39b989898376868af66f0c17ea Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Fri, 27 Sep 2019 19:36:53 +0000
Subject: [PATCH v33 2/3] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   13 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   33 +
 src/backend/executor/nodeIncrementalSort.c    | 1107 ++++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  194 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  129 +-
 src/backend/optimizer/plan/planner.c          |   71 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  194 ++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   30 +
 src/include/nodes/execnodes.h                 |   68 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   11 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1160 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   78 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3505 insertions(+), 115 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec..9436d96bc0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4490,6 +4490,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..14aedec919 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1239,6 +1243,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1897,6 +1904,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2225,12 +2238,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2241,7 +2271,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2265,7 +2295,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2334,7 +2364,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2391,7 +2421,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2404,13 +2434,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2450,9 +2481,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2666,6 +2701,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->fullsort_state != NULL)
+	{
+		/* TODO: is it valid to get space used etc. only once given we re-use the sort? */
+		/* TODO: maybe show average, min, max sort group size? */
+
+		Tuplesortstate *fullsort_state = incrsortstate->fullsort_state;
+		TuplesortInstrumentation fullsort_stats;
+		const char *fullsort_sortMethod;
+		const char *fullsort_spaceType;
+		Tuplesortstate *prefixsort_state = incrsortstate->prefixsort_state;
+		TuplesortInstrumentation prefixsort_stats;
+		const char *prefixsort_sortMethod;
+		const char *prefixsort_spaceType;
+
+		tuplesort_get_stats(fullsort_state, &fullsort_stats);
+		fullsort_sortMethod = tuplesort_method_name(fullsort_stats.sortMethod);
+		fullsort_spaceType = tuplesort_space_type_name(fullsort_stats.spaceType);
+		if (prefixsort_state != NULL)
+		{
+			tuplesort_get_stats(prefixsort_state, &prefixsort_stats);
+			prefixsort_sortMethod = tuplesort_method_name(prefixsort_stats.sortMethod);
+			prefixsort_spaceType = tuplesort_space_type_name(prefixsort_stats.spaceType);
+		}
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: Full: %s  %s: %ldkB",
+							 fullsort_sortMethod, fullsort_spaceType,
+							 fullsort_stats.spaceUsed);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %s %s: %ldkB\n",
+								 prefixsort_sortMethod, prefixsort_spaceType,
+								 prefixsort_stats.spaceUsed);
+			else
+				appendStringInfo(es->str, "\n");
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: Full:  %ld",
+							 incrsortstate->fullsort_group_count);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %ld\n",
+							 incrsortstate->prefixsort_group_count);
+			else
+				appendStringInfo(es->str, "\n");
+		}
+		else
+		{
+			/* TODO */
+			ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+			ExplainPropertyInteger("Full Sort Space Used", "kB",
+					fullsort_stats.spaceUsed, es);
+			ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+			ExplainPropertyInteger("Full Sort Groups", NULL,
+								   incrsortstate->fullsort_group_count, es);
+
+			if (prefixsort_state != NULL)
+			{
+				ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+				ExplainPropertyInteger("Prefix Sort Space Used", "kB",
+						prefixsort_stats.spaceUsed, es);
+				ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+				ExplainPropertyInteger("Prefix Sort Groups", NULL,
+									   incrsortstate->prefixsort_group_count, es);
+			}
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+				&incrsortstate->shared_info->sinfo[n];
+			TuplesortInstrumentation *fullsort_instrument;
+			const char *fullsort_sortMethod;
+			const char *fullsort_spaceType;
+			long		fullsort_spaceUsed;
+			int64		fullsort_group_count;
+			TuplesortInstrumentation *prefixsort_instrument;
+			const char *prefixsort_sortMethod;
+			const char *prefixsort_spaceType;
+			long		prefixsort_spaceUsed;
+			int64		prefixsort_group_count;
+
+			fullsort_instrument = &incsort_info->fullsort_instrument;
+			fullsort_group_count = incsort_info->fullsort_group_count;
+
+			prefixsort_instrument = &incsort_info->prefixsort_instrument;
+			prefixsort_group_count = incsort_info->prefixsort_group_count;
+
+			if (fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+
+			fullsort_sortMethod = tuplesort_method_name(
+					fullsort_instrument->sortMethod);
+			fullsort_spaceType = tuplesort_space_type_name(
+					fullsort_instrument->spaceType);
+			fullsort_spaceUsed = fullsort_instrument->spaceUsed;
+
+			if (prefixsort_instrument)
+			{
+				prefixsort_sortMethod = tuplesort_method_name(
+						prefixsort_instrument->sortMethod);
+				prefixsort_spaceType = tuplesort_space_type_name(
+						prefixsort_instrument->spaceType);
+				prefixsort_spaceUsed = prefixsort_instrument->spaceUsed;
+			}
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d: Full Sort Method: %s  %s: %ldkB  Groups: %ld",
+								 n, fullsort_sortMethod, fullsort_spaceType,
+								 fullsort_spaceUsed, fullsort_group_count);
+				if (prefixsort_instrument)
+					appendStringInfo(es->str,
+									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+									 prefixsort_sortMethod, prefixsort_spaceType,
+									 prefixsort_spaceUsed, prefixsort_group_count);
+				else
+					appendStringInfo(es->str, "\n");
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+				ExplainPropertyInteger("Full Sort Space Used", "kB", fullsort_spaceUsed, es);
+				ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+				ExplainPropertyInteger("Full Sort Groups", NULL, fullsort_group_count, es);
+				if (prefixsort_instrument)
+				{
+					ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+					ExplainPropertyInteger("Prefix Sort Space Used", "kB", prefixsort_spaceUsed, es);
+					ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+					ExplainPropertyInteger("Prefix Sort Groups", NULL, prefixsort_group_count, es);
+				}
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..cba648a95e 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..8051f46a71 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState  *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..c3b903e568
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1107 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int presortedCols, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail keys
+	 * are more likely to change. Therefore we do our comparison starting from
+	 * the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(IncrementalSortState *node)
+{
+	ScanDirection		dir;
+	int64 nTuples = 0;
+	bool lastTuple = false;
+	bool firstTuple = true;
+	TupleDesc		    tupDesc;
+	PlanState		   *outerNode;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal
+		 * and thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(
+				tupDesc,
+				plannode->sort.numCols - presortedCols,
+				&(plannode->sort.sortColIdx[presortedCols]),
+				&(plannode->sort.sortOperators[presortedCols]),
+				&(plannode->sort.collations[presortedCols]),
+				&(plannode->sort.nullsFirst[presortedCols]),
+				work_mem,
+				NULL,
+				false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure
+	 * the tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+				node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+					ScanDirectionIsForward(dir),
+					false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to save the
+			 * first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/* The tuple isn't part of the current batch so we need to carry
+				 * it over into the next set up tuples we transfer out of the full
+				 * sort tuplesort into the presorted prefix tuplesort. We don't
+				 * actually have to do anything special to save the tuple since
+				 * we've already loaded it into the node->transfer_tuple slot, and,
+				 * even though that slot points to memory inside the full sort
+				 * tuplesort, we can't reset that tuplesort anyway until we've
+				 * fully transferred out of its tuples, so this reference is safe.
+				 * We do need to reset the group pivot tuple though since we've
+				 * finished the current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch
+		 * is in the same prefix key group and moved all of those tuples into
+		 * the presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/* Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume
+		 * we have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner node
+		 * read out all of those tuples, and then come back around to find
+		 * another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *read_sortstate;
+	Tuplesortstate	   *fullsort_state;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+			|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->finished)
+			/*
+			 * TODO: there isn't a good test case for the node->finished
+			 * case directly, but lots of other stuff fails if it's not
+			 * there. If the outer node will fail when trying to fetch
+			 * too many tuples, then things break if that test isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the full
+			 * sort tuplesort. The first call to switchToPresortedPrefixMode()
+			 * pulled the one of those groups out, and we've returned those
+			 * tuples to the inner node, but if we tuples remaining in that
+			 * tuplesort (i.e., n_fullsort_remaining > 0) at this point we
+			 * need to do that again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(node);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're not
+			 * in the middle of transitioning into presorted prefix sort mode,
+			 * then it's time to start the process all over again by building
+			 * new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(
+					tupDesc,
+					plannode->sort.numCols,
+					plannode->sort.sortColIdx,
+					plannode->sort.sortOperators,
+					plannode->sort.collations,
+					plannode->sort.nullsFirst,
+					work_mem,
+					NULL,
+					false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64 currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n heap
+			 * sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/* Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't assume
+			 * the group pivot tuple will reamin the same -- unless we're using
+			 * a minimum group size of 1, in which case the pivot is obviously
+			 * still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us anymore tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then don't
+				 * both with checking for inclusion in the current prefix group
+				 * since a large number of very tiny sorts is inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this
+					 * sort group. Instead we need to carry it over to the
+					 * next group. We use the group_pivot slot as a temp
+					 * container for that purpose even though we won't actually
+					 * treat it as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound remaining
+						 * is (original bound - n), so store the current number
+						 * of processed tuples for use in configuring sorting
+						 * bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+					tuplesort_performsort(fullsort_state);
+					node->fullsort_group_count++;
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found
+			 * a large group of tuples having a single prefix key (as long
+			 * as the last tuple didn't shift us into reading from the full
+			 * sort mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+					node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into the
+				 * tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n heapsort mode
+				 * then we will only be able to retrieve currentBound tuples (since the
+				 * tuplesort will have only retained the top-n tuples). This is safe even
+				 * though we haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the retained ones,
+				 * and we're already contractually guaranteed to not need any more than the
+				 * currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64 currentBound = node->bound - node->bound_Done;
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the fullsort
+				 * to presorted prefix sort (we might have multiple prefix key
+				 * groups, so we need a way to see if we've actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(node);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop out
+				 * of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've already
+		 * established a current group pivot tuple (but wasn't carried over;
+		 * it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix tuplesort
+				 * until we find changed prefix keys. Only then can we guarantee
+				 * sort stability of the tuples we've already accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		IncrementalSortInfo *incsort_info =
+			&node->shared_info->sinfo[ParallelWorkerNumber];
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		tuplesort_get_stats(fullsort_state, &incsort_info->fullsort_instrument);
+		incsort_info->fullsort_group_count = node->fullsort_group_count;
+
+		if (node->prefixsort_state)
+		{
+			tuplesort_get_stats(node->prefixsort_state,
+					&incsort_info->prefixsort_instrument);
+			incsort_info->prefixsort_group_count = node->prefixsort_group_count;
+		}
+	}
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_group_count = 0;
+	incrsortstate->prefixsort_group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+		tuplesort_end(node->fullsort_state);
+	node->fullsort_state = NULL;
+	if (node->prefixsort_state != NULL)
+		tuplesort_end(node->prefixsort_state);
+	node->prefixsort_state = NULL;
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..d2b9bd95ba 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721..d1748d1011 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..6e2ba08d7b 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int	n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none.
+	 * Any leading common pathkeys could be useful for ordering because
+	 * we can use the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..53d08aed2e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
 
-	cost_sort(&sort_path, root, NIL,
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	node = makeNode(Sort);
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314..e9918ffcb4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4924,8 +4924,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4964,29 +4964,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can take
+				 * advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3e836e6e1c..88402a9033 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2741,6 +2741,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index dbecc00fef..e21768207d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -988,6 +988,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..c2bd38f39f 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +250,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1250,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1323,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2724,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2774,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3271,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f..f905e384a2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfo	fcinfo; /* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2022,60 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	fullsort_instrument;
+	int64						fullsort_group_count;
+	TuplesortInstrumentation	prefixsort_instrument;
+	int64						prefixsort_group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64			n_fullsort_remaining;
+	Tuplesortstate	   *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate	   *prefixsort_state; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		fullsort_group_count;	/* number of groups with equal presorted keys */
+	int64		prefixsort_group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..fe4046b64b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba198..bfee4db721 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..57ecbbb01c 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..77c03149cd 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..3a58efdf91
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1160 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b9df37412f
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,78 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

#193

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#192)

3 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 7, 2020 at 5:47 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Jan 21, 2020 at 9:37 AM James Coleman <jtc331@gmail.com> wrote:

That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

James

I'm attaching a rebased patch revision + a new commit that reworks EXPLAIN
output. I left that patch as separate for now so that it was easy enough to
see the difference, and so that as Tomas is working on stuff in parallel I
don't unnecessarily cause merge conflicts for now, but next patch revision
(assuming the EXPLAIN change looks good) can just incorporate it into the
base patch.

Here's what I've changed:

- The stats necessary for ANALYZE are now only kept if the PlanState has a
non-null instrument field (thanks to Tom for pointing out this as the
correct way to check that ANALYZE is in flight). I did leave lines like
`node->incsort_info.fullsortGroupInfo.groupCount++;` unguarded by that `if`
since it seems like practically zero overhead (and almost equal to check
the condition), but if anyone disagrees, I'm happy to change it.
Additionally those lines (if ANALYZE is not in flight) are technically
operating on variables that haven't explicitly been initialized in the Init
function; please tell me if that's actually an issue given the are counters
and we won't be using them in that case.

And...I discovered that I need to do this anyway. Basically, the originally
patch stored per-worker instrumentation information on every tuple fetch,
which is unnecessary, and in my haste refactoring I'd replaced that spot
with my code to better record stats. The original patch just looked at the
last tuple sort state, as I'd mentioned previously, so didn't have any
special instrumentation in non-parallel workers other than incrementing
group counters.

But obviously we don't want to record stats from every tuple, we want to
record sort info every time we finalize a sort. And so I've replaced the
group counter increment lines with calls to a newly broken out function to
record stats for the appropriate fullsort/prefixsort group info.

I came across while adding tests for EXPLAIN ANALYZE and saw a result with
the reported average memory usage higher than the max--this happened since
I was adding the memory used each time through the loop rather than once
when finalizing the sort.

- A good bit of cleanup on how parallel workers are output (I believe
there was some duplicative group opening and also inconsistent text output
with other multi-worker explain nodes). I haven't had a chance to test this
yet, thought, so there could be bugs.

Note: I still haven't had time to test parallel plans with the updated
EXPLAIN, so there aren't tests for that either.

- I also left a TODO wondering if we should break out the instrumentation
into a separate function; it seems like a decent sized chunk of cleanly
extractable code; I suppose that's always a bit of personal preference, so
anyone who wants to weigh in gets a vote :)

I ended up having to do this anyway, for reasons described above.

See new version attached (still with EXPLAIN changes as a separate patch
file).

James

Attachments:

v34-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v34-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 269df4be255b3b117e1ff27ba9b340594aacaf2c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v34 1/3] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 47 ++++++++++-----------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d9ce516211..3e836e6e1c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -777,41 +777,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Unless pathkeys are incompatible, keep just one of the two paths. */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
-			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 

base-commit: b9c3de62cbc9c6993ceac0de99985cf051e91c88
-- 
2.17.1

v34-0003-Rework-EXPLAIN-for-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v34-0003-Rework-EXPLAIN-for-incremental-sort.patchDownload

From 83dabf98119a2d8f0b5c420da694fdbd1eeb5078 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sat, 7 Mar 2020 17:09:39 -0500
Subject: [PATCH v34 3/3] Rework EXPLAIN for incremental sort

---
 src/backend/commands/explain.c                | 253 +++++++++---------
 src/backend/executor/nodeIncrementalSort.c    | 121 ++++++---
 src/include/nodes/execnodes.h                 |  29 +-
 .../regress/expected/incremental_sort.out     | 160 +++++++++++
 src/test/regress/sql/incremental_sort.sql     |  10 +
 5 files changed, 402 insertions(+), 171 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 14aedec919..8262c54e6a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2701,80 +2701,114 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
-/*
- * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
- */
+
 static void
-show_incremental_sort_info(IncrementalSortState *incrsortstate,
-						   ExplainState *es)
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+						   const char *groupLabel, ExplainState *es)
 {
-	if (es->analyze && incrsortstate->sort_Done &&
-		incrsortstate->fullsort_state != NULL)
+	const char *sortMethodName;
+	const char *spaceTypeName;
+	ListCell *methodCell;
+	int methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
-		/* TODO: is it valid to get space used etc. only once given we re-use the sort? */
-		/* TODO: maybe show average, min, max sort group size? */
-
-		Tuplesortstate *fullsort_state = incrsortstate->fullsort_state;
-		TuplesortInstrumentation fullsort_stats;
-		const char *fullsort_sortMethod;
-		const char *fullsort_spaceType;
-		Tuplesortstate *prefixsort_state = incrsortstate->prefixsort_state;
-		TuplesortInstrumentation prefixsort_stats;
-		const char *prefixsort_sortMethod;
-		const char *prefixsort_spaceType;
-
-		tuplesort_get_stats(fullsort_state, &fullsort_stats);
-		fullsort_sortMethod = tuplesort_method_name(fullsort_stats.sortMethod);
-		fullsort_spaceType = tuplesort_space_type_name(fullsort_stats.spaceType);
-		if (prefixsort_state != NULL)
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
 		{
-			tuplesort_get_stats(prefixsort_state, &prefixsort_stats);
-			prefixsort_sortMethod = tuplesort_method_name(prefixsort_stats.sortMethod);
-			prefixsort_spaceType = tuplesort_space_type_name(prefixsort_stats.spaceType);
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
 		}
+		appendStringInfo(es->str, ")");
 
-		if (es->format == EXPLAIN_FORMAT_TEXT)
+		if (groupInfo->maxMemorySpaceUsed > 0)
 		{
-			appendStringInfoSpaces(es->str, es->indent * 2);
-			appendStringInfo(es->str, "Sort Method: Full: %s  %s: %ldkB",
-							 fullsort_sortMethod, fullsort_spaceType,
-							 fullsort_stats.spaceUsed);
-			if (prefixsort_state != NULL)
-				appendStringInfo(es->str, ", Prefix-only: %s %s: %ldkB\n",
-								 prefixsort_sortMethod, prefixsort_spaceType,
-								 prefixsort_stats.spaceUsed);
-			else
-				appendStringInfo(es->str, "\n");
-			appendStringInfoSpaces(es->str, es->indent * 2);
-			appendStringInfo(es->str, "Sort Groups: Full:  %ld",
-							 incrsortstate->fullsort_group_count);
-			if (prefixsort_state != NULL)
-				appendStringInfo(es->str, ", Prefix-only: %ld\n",
-							 incrsortstate->prefixsort_group_count);
-			else
-				appendStringInfo(es->str, "\n");
+			long avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
 		}
-		else
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
 		{
-			/* TODO */
-			ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
-			ExplainPropertyInteger("Full Sort Space Used", "kB",
-					fullsort_stats.spaceUsed, es);
-			ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
-			ExplainPropertyInteger("Full Sort Groups", NULL,
-								   incrsortstate->fullsort_group_count, es);
-
-			if (prefixsort_state != NULL)
-			{
-				ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
-				ExplainPropertyInteger("Prefix Sort Space Used", "kB",
-						prefixsort_stats.spaceUsed, es);
-				ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
-				ExplainPropertyInteger("Prefix Sort Groups", NULL,
-									   incrsortstate->prefixsort_group_count, es);
-			}
+			long avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			methodNames = lappend(methodNames, sortMethodName);
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+					groupInfo->maxMemorySpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+					groupInfo->maxDiskSpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
 		}
+
+		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
 	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	if (!(es->analyze && incrsortstate->sort_Done))
+		return;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+	if (fullsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
 
 	if (incrsortstate->shared_info != NULL)
 	{
@@ -2785,79 +2819,36 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 		{
 			IncrementalSortInfo *incsort_info =
 				&incrsortstate->shared_info->sinfo[n];
-			TuplesortInstrumentation *fullsort_instrument;
-			const char *fullsort_sortMethod;
-			const char *fullsort_spaceType;
-			long		fullsort_spaceUsed;
-			int64		fullsort_group_count;
-			TuplesortInstrumentation *prefixsort_instrument;
-			const char *prefixsort_sortMethod;
-			const char *prefixsort_spaceType;
-			long		prefixsort_spaceUsed;
-			int64		prefixsort_group_count;
-
-			fullsort_instrument = &incsort_info->fullsort_instrument;
-			fullsort_group_count = incsort_info->fullsort_group_count;
-
-			prefixsort_instrument = &incsort_info->prefixsort_instrument;
-			prefixsort_group_count = incsort_info->prefixsort_group_count;
-
-			if (fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
-				continue;		/* ignore any unfilled slots */
-
-			fullsort_sortMethod = tuplesort_method_name(
-					fullsort_instrument->sortMethod);
-			fullsort_spaceType = tuplesort_space_type_name(
-					fullsort_instrument->spaceType);
-			fullsort_spaceUsed = fullsort_instrument->spaceUsed;
+			/*
+			 * XXX: The previous version of the patch chcked:
+			 * fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+			 * and continued if the condition was true (with the comment "ignore
+			 * any unfilled slots").
+			 * I'm not convinced that makes sense since the same sort instrument
+			 * can have been used multiple times, so the last time it being used
+			 * being still in progress, doesn't seem to be relevant.
+			 * Instead I'm now checking to see if the group count for each group
+			 * info is 0. If both are 0, then we exclude the worker since it
+			 * didn't contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+					prefixsortGroupInfo->groupCount == 0)
+				continue;
 
-			if (prefixsort_instrument)
+			if (!opened_group)
 			{
-				prefixsort_sortMethod = tuplesort_method_name(
-						prefixsort_instrument->sortMethod);
-				prefixsort_spaceType = tuplesort_space_type_name(
-						prefixsort_instrument->spaceType);
-				prefixsort_spaceUsed = prefixsort_instrument->spaceUsed;
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
 			}
 
-			if (es->format == EXPLAIN_FORMAT_TEXT)
-			{
-				appendStringInfoSpaces(es->str, es->indent * 2);
-				appendStringInfo(es->str,
-								 "Worker %d: Full Sort Method: %s  %s: %ldkB  Groups: %ld",
-								 n, fullsort_sortMethod, fullsort_spaceType,
-								 fullsort_spaceUsed, fullsort_group_count);
-				if (prefixsort_instrument)
-					appendStringInfo(es->str,
-									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
-									 prefixsort_sortMethod, prefixsort_spaceType,
-									 prefixsort_spaceUsed, prefixsort_group_count);
-				else
-					appendStringInfo(es->str, "\n");
-			}
-			else
-			{
-				if (!opened_group)
-				{
-					ExplainOpenGroup("Workers", "Workers", false, es);
-					opened_group = true;
-				}
-				ExplainOpenGroup("Worker", NULL, true, es);
-				ExplainPropertyInteger("Worker Number", NULL, n, es);
-				ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
-				ExplainPropertyInteger("Full Sort Space Used", "kB", fullsort_spaceUsed, es);
-				ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
-				ExplainPropertyInteger("Full Sort Groups", NULL, fullsort_group_count, es);
-				if (prefixsort_instrument)
-				{
-					ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
-					ExplainPropertyInteger("Prefix Sort Space Used", "kB", prefixsort_spaceUsed, es);
-					ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
-					ExplainPropertyInteger("Prefix Sort Groups", NULL, prefixsort_group_count, es);
-				}
-				ExplainCloseGroup("Worker", NULL, true, es);
-			}
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
 		}
+
 		if (opened_group)
 			ExplainCloseGroup("Workers", "Workers", false, es);
 	}
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index c3b903e568..e6f749a798 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -68,6 +68,47 @@
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
 
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+	Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation	sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+				sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+				&node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
 /*
  * Prepare information for presorted_keys comparison.
  */
@@ -199,8 +240,9 @@ isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot
  * one different prefix key group before the large prefix key group.
  */
 static void
-switchToPresortedPrefixMode(IncrementalSortState *node)
+switchToPresortedPrefixMode(PlanState *pstate)
 {
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
 	ScanDirection		dir;
 	int64 nTuples = 0;
 	bool lastTuple = false;
@@ -355,7 +397,11 @@ switchToPresortedPrefixMode(IncrementalSortState *node)
 		 */
 		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
 		tuplesort_performsort(node->prefixsort_state);
-		node->prefixsort_group_count++;
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+					&node->incsort_info.prefixsortGroupInfo,
+					node->prefixsort_state);
 
 		if (node->bounded)
 		{
@@ -479,7 +525,7 @@ ExecIncrementalSort(PlanState *pstate)
 			 */
 			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
 					node->n_fullsort_remaining);
-			switchToPresortedPrefixMode(node);
+			switchToPresortedPrefixMode(pstate);
 		}
 		else
 		{
@@ -602,7 +648,11 @@ ExecIncrementalSort(PlanState *pstate)
 
 				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
 				tuplesort_performsort(fullsort_state);
-				node->fullsort_group_count++;
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+							&node->incsort_info.fullsortGroupInfo,
+							fullsort_state);
 
 				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
 				node->execution_status = INCSORT_READFULLSORT;
@@ -673,7 +723,12 @@ ExecIncrementalSort(PlanState *pstate)
 					 */
 					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
 					tuplesort_performsort(fullsort_state);
-					node->fullsort_group_count++;
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+								&node->incsort_info.fullsortGroupInfo,
+								fullsort_state);
+
 					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
 					node->execution_status = INCSORT_READFULLSORT;
 					break;
@@ -705,7 +760,10 @@ ExecIncrementalSort(PlanState *pstate)
 				 */
 				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
 				tuplesort_performsort(fullsort_state);
-				node->fullsort_group_count++;
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+							&node->incsort_info.fullsortGroupInfo,
+							fullsort_state);
 
 				/*
 				 * If the full sort tuplesort happened to switch into top-n heapsort mode
@@ -735,7 +793,7 @@ ExecIncrementalSort(PlanState *pstate)
 				node->n_fullsort_remaining = nTuples;
 
 				/* Transition the tuples to the presorted prefix tuplesort. */
-				switchToPresortedPrefixMode(node);
+				switchToPresortedPrefixMode(pstate);
 
 				/*
 				 * Since we know we had tuples to move to the presorted prefix
@@ -801,7 +859,12 @@ ExecIncrementalSort(PlanState *pstate)
 		/* Perform the sort and return the tuples to the inner plan nodes. */
 		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
 		tuplesort_performsort(node->prefixsort_state);
-		node->prefixsort_group_count++;
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+					&node->incsort_info.prefixsortGroupInfo,
+					node->prefixsort_state);
+
 		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
 		node->execution_status = INCSORT_READPREFIXSORT;
 
@@ -828,26 +891,6 @@ ExecIncrementalSort(PlanState *pstate)
 	 */
 	node->sort_Done = true;
 
-	/* Record shared stats if we're a parallel worker. */
-	if (node->shared_info && node->am_worker)
-	{
-		IncrementalSortInfo *incsort_info =
-			&node->shared_info->sinfo[ParallelWorkerNumber];
-
-		Assert(IsParallelWorker());
-		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
-
-		tuplesort_get_stats(fullsort_state, &incsort_info->fullsort_instrument);
-		incsort_info->fullsort_group_count = node->fullsort_group_count;
-
-		if (node->prefixsort_state)
-		{
-			tuplesort_get_stats(node->prefixsort_state,
-					&incsort_info->prefixsort_instrument);
-			incsort_info->prefixsort_group_count = node->prefixsort_group_count;
-		}
-	}
-
 	/*
 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
 	 * tuples.
@@ -900,10 +943,28 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 	incrsortstate->transfer_tuple = NULL;
 	incrsortstate->n_fullsort_remaining = 0;
 	incrsortstate->bound_Done = 0;
-	incrsortstate->fullsort_group_count = 0;
-	incrsortstate->prefixsort_group_count = 0;
 	incrsortstate->presorted_keys = NULL;
 
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+			&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+			&incrsortstate->incsort_info.prefixsortGroupInfo;
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
 	/*
 	 * Miscellaneous initialization
 	 *
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f905e384a2..0934482123 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2022,18 +2022,26 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
-/* ----------------
- *	 Shared memory container for per-worker incremental sort information
- * ----------------
- */
+typedef struct IncrementalSortGroupInfo
+{
+	int64 groupCount;
+	long maxDiskSpaceUsed;
+	long totalDiskSpaceUsed;
+	long maxMemorySpaceUsed;
+	long totalMemorySpaceUsed;
+	List *sortMethods;
+} IncrementalSortGroupInfo;
+
 typedef struct IncrementalSortInfo
 {
-	TuplesortInstrumentation	fullsort_instrument;
-	int64						fullsort_group_count;
-	TuplesortInstrumentation	prefixsort_instrument;
-	int64						prefixsort_group_count;
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
 } IncrementalSortInfo;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
 typedef struct SharedIncrementalSortInfo
 {
 	int							num_workers;
@@ -2067,8 +2075,9 @@ typedef struct IncrementalSortState
 	Tuplesortstate	   *prefixsort_state; /* private state of tuplesort.c */
 	/* the keys by which the input path is already sorted */
 	PresortedKeyData *presorted_keys;
-	int64		fullsort_group_count;	/* number of groups with equal presorted keys */
-	int64		prefixsort_group_count;	/* number of groups with equal presorted keys */
+
+	IncrementalSortInfo incsort_info;
+
 	/* slot for pivot tuple defining values of presorted keys within group */
 	TupleTableSlot *group_pivot;
 	TupleTableSlot *transfer_tuple;
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 3a58efdf91..7892b111d7 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -436,6 +436,82 @@ select * from (select * from t order by a) s order by a, b limit 55;
  2 | 55
 (55 rows)
 
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+                                           QUERY PLAN                                            
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: 27kB (avg), 27kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ [                                                                +
+   {                                                              +
+     "Plan": {                                                    +
+       "Node Type": "Limit",                                      +
+       "Parallel Aware": false,                                   +
+       "Actual Rows": 55,                                         +
+       "Actual Loops": 1,                                         +
+       "Plans": [                                                 +
+         {                                                        +
+           "Node Type": "Incremental Sort",                       +
+           "Parent Relationship": "Outer",                        +
+           "Parallel Aware": false,                               +
+           "Actual Rows": 55,                                     +
+           "Actual Loops": 1,                                     +
+           "Sort Key": ["t.a", "t.b"],                            +
+           "Presorted Key": ["t.a"],                              +
+           "Full-sort Groups": {                                  +
+             "Group Count": 2,                                    +
+             "Sort Methods Used": ["quicksort", "top-N heapsort"],+
+             "Average Sort Space Used": 27,                       +
+             "Maximum Sort Space Used": 27,                       +
+             "Sort Space Type": "Memory"                          +
+           },                                                     +
+           "Plans": [                                             +
+             {                                                    +
+               "Node Type": "Sort",                               +
+               "Parent Relationship": "Outer",                    +
+               "Parallel Aware": false,                           +
+               "Actual Rows": 100,                                +
+               "Actual Loops": 1,                                 +
+               "Sort Key": ["t.a"],                               +
+               "Sort Method": "quicksort",                        +
+               "Sort Space Used": 30,                             +
+               "Sort Space Type": "Memory",                       +
+               "Plans": [                                         +
+                 {                                                +
+                   "Node Type": "Seq Scan",                       +
+                   "Parent Relationship": "Outer",                +
+                   "Parallel Aware": false,                       +
+                   "Relation Name": "t",                          +
+                   "Alias": "t",                                  +
+                   "Actual Rows": 100,                            +
+                   "Actual Loops": 1                              +
+                 }                                                +
+               ]                                                  +
+             }                                                    +
+           ]                                                      +
+         }                                                        +
+       ]                                                          +
+     },                                                           +
+     "Triggers": [                                                +
+     ]                                                            +
+   }                                                              +
+ ]
+(1 row)
+
 delete from t;
 -- An initial small group followed by a large group.
 insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
@@ -526,6 +602,90 @@ select * from (select * from t order by a) s order by a, b limit 70;
  9 | 70
 (70 rows)
 
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: 28kB (avg), 28kB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: 26kB (avg), 30kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+                    QUERY PLAN                     
+---------------------------------------------------
+ [                                                +
+   {                                              +
+     "Plan": {                                    +
+       "Node Type": "Limit",                      +
+       "Parallel Aware": false,                   +
+       "Actual Rows": 70,                         +
+       "Actual Loops": 1,                         +
+       "Plans": [                                 +
+         {                                        +
+           "Node Type": "Incremental Sort",       +
+           "Parent Relationship": "Outer",        +
+           "Parallel Aware": false,               +
+           "Actual Rows": 70,                     +
+           "Actual Loops": 1,                     +
+           "Sort Key": ["t.a", "t.b"],            +
+           "Presorted Key": ["t.a"],              +
+           "Full-sort Groups": {                  +
+             "Group Count": 1,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 28,       +
+             "Maximum Sort Space Used": 28,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Presorted Groups": {                  +
+             "Group Count": 5,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 26,       +
+             "Maximum Sort Space Used": 30,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Plans": [                             +
+             {                                    +
+               "Node Type": "Sort",               +
+               "Parent Relationship": "Outer",    +
+               "Parallel Aware": false,           +
+               "Actual Rows": 100,                +
+               "Actual Loops": 1,                 +
+               "Sort Key": ["t.a"],               +
+               "Sort Method": "quicksort",        +
+               "Sort Space Used": 30,             +
+               "Sort Space Type": "Memory",       +
+               "Plans": [                         +
+                 {                                +
+                   "Node Type": "Seq Scan",       +
+                   "Parent Relationship": "Outer",+
+                   "Parallel Aware": false,       +
+                   "Relation Name": "t",          +
+                   "Alias": "t",                  +
+                   "Actual Rows": 100,            +
+                   "Actual Loops": 1              +
+                 }                                +
+               ]                                  +
+             }                                    +
+           ]                                      +
+         }                                        +
+       ]                                          +
+     },                                           +
+     "Triggers": [                                +
+     ]                                            +
+   }                                              +
+ ]
+(1 row)
+
 delete from t;
 -- Small groups of 10 tuples each tested around each mode transition point.
 insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index b9df37412f..9320a10b91 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -39,12 +39,22 @@ delete from t;
 insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
 select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
 delete from t;
 
 -- An initial small group followed by a large group.
 insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
 select * from (select * from t order by a) s order by a, b limit 70;
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
 delete from t;
 
 -- Small groups of 10 tuples each tested around each mode transition point.
-- 
2.17.1

v34-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v34-0002-Implement-incremental-sort.patchDownload

From 3773b33250d78c39b989898376868af66f0c17ea Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Fri, 27 Sep 2019 19:36:53 +0000
Subject: [PATCH v34 2/3] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   13 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   33 +
 src/backend/executor/nodeIncrementalSort.c    | 1107 ++++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  194 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  129 +-
 src/backend/optimizer/plan/planner.c          |   71 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  194 ++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   30 +
 src/include/nodes/execnodes.h                 |   68 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   11 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1160 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   78 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3505 insertions(+), 115 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec..9436d96bc0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4490,6 +4490,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..14aedec919 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1239,6 +1243,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1897,6 +1904,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2225,12 +2238,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2241,7 +2271,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2265,7 +2295,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2334,7 +2364,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2391,7 +2421,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2404,13 +2434,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2450,9 +2481,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2666,6 +2701,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->fullsort_state != NULL)
+	{
+		/* TODO: is it valid to get space used etc. only once given we re-use the sort? */
+		/* TODO: maybe show average, min, max sort group size? */
+
+		Tuplesortstate *fullsort_state = incrsortstate->fullsort_state;
+		TuplesortInstrumentation fullsort_stats;
+		const char *fullsort_sortMethod;
+		const char *fullsort_spaceType;
+		Tuplesortstate *prefixsort_state = incrsortstate->prefixsort_state;
+		TuplesortInstrumentation prefixsort_stats;
+		const char *prefixsort_sortMethod;
+		const char *prefixsort_spaceType;
+
+		tuplesort_get_stats(fullsort_state, &fullsort_stats);
+		fullsort_sortMethod = tuplesort_method_name(fullsort_stats.sortMethod);
+		fullsort_spaceType = tuplesort_space_type_name(fullsort_stats.spaceType);
+		if (prefixsort_state != NULL)
+		{
+			tuplesort_get_stats(prefixsort_state, &prefixsort_stats);
+			prefixsort_sortMethod = tuplesort_method_name(prefixsort_stats.sortMethod);
+			prefixsort_spaceType = tuplesort_space_type_name(prefixsort_stats.spaceType);
+		}
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: Full: %s  %s: %ldkB",
+							 fullsort_sortMethod, fullsort_spaceType,
+							 fullsort_stats.spaceUsed);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %s %s: %ldkB\n",
+								 prefixsort_sortMethod, prefixsort_spaceType,
+								 prefixsort_stats.spaceUsed);
+			else
+				appendStringInfo(es->str, "\n");
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: Full:  %ld",
+							 incrsortstate->fullsort_group_count);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %ld\n",
+							 incrsortstate->prefixsort_group_count);
+			else
+				appendStringInfo(es->str, "\n");
+		}
+		else
+		{
+			/* TODO */
+			ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+			ExplainPropertyInteger("Full Sort Space Used", "kB",
+					fullsort_stats.spaceUsed, es);
+			ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+			ExplainPropertyInteger("Full Sort Groups", NULL,
+								   incrsortstate->fullsort_group_count, es);
+
+			if (prefixsort_state != NULL)
+			{
+				ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+				ExplainPropertyInteger("Prefix Sort Space Used", "kB",
+						prefixsort_stats.spaceUsed, es);
+				ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+				ExplainPropertyInteger("Prefix Sort Groups", NULL,
+									   incrsortstate->prefixsort_group_count, es);
+			}
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+				&incrsortstate->shared_info->sinfo[n];
+			TuplesortInstrumentation *fullsort_instrument;
+			const char *fullsort_sortMethod;
+			const char *fullsort_spaceType;
+			long		fullsort_spaceUsed;
+			int64		fullsort_group_count;
+			TuplesortInstrumentation *prefixsort_instrument;
+			const char *prefixsort_sortMethod;
+			const char *prefixsort_spaceType;
+			long		prefixsort_spaceUsed;
+			int64		prefixsort_group_count;
+
+			fullsort_instrument = &incsort_info->fullsort_instrument;
+			fullsort_group_count = incsort_info->fullsort_group_count;
+
+			prefixsort_instrument = &incsort_info->prefixsort_instrument;
+			prefixsort_group_count = incsort_info->prefixsort_group_count;
+
+			if (fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+
+			fullsort_sortMethod = tuplesort_method_name(
+					fullsort_instrument->sortMethod);
+			fullsort_spaceType = tuplesort_space_type_name(
+					fullsort_instrument->spaceType);
+			fullsort_spaceUsed = fullsort_instrument->spaceUsed;
+
+			if (prefixsort_instrument)
+			{
+				prefixsort_sortMethod = tuplesort_method_name(
+						prefixsort_instrument->sortMethod);
+				prefixsort_spaceType = tuplesort_space_type_name(
+						prefixsort_instrument->spaceType);
+				prefixsort_spaceUsed = prefixsort_instrument->spaceUsed;
+			}
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d: Full Sort Method: %s  %s: %ldkB  Groups: %ld",
+								 n, fullsort_sortMethod, fullsort_spaceType,
+								 fullsort_spaceUsed, fullsort_group_count);
+				if (prefixsort_instrument)
+					appendStringInfo(es->str,
+									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+									 prefixsort_sortMethod, prefixsort_spaceType,
+									 prefixsort_spaceUsed, prefixsort_group_count);
+				else
+					appendStringInfo(es->str, "\n");
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+				ExplainPropertyInteger("Full Sort Space Used", "kB", fullsort_spaceUsed, es);
+				ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+				ExplainPropertyInteger("Full Sort Groups", NULL, fullsort_group_count, es);
+				if (prefixsort_instrument)
+				{
+					ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+					ExplainPropertyInteger("Prefix Sort Space Used", "kB", prefixsort_spaceUsed, es);
+					ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+					ExplainPropertyInteger("Prefix Sort Groups", NULL, prefixsort_group_count, es);
+				}
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..cba648a95e 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..8051f46a71 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState  *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..c3b903e568
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1107 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int presortedCols, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail keys
+	 * are more likely to change. Therefore we do our comparison starting from
+	 * the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(IncrementalSortState *node)
+{
+	ScanDirection		dir;
+	int64 nTuples = 0;
+	bool lastTuple = false;
+	bool firstTuple = true;
+	TupleDesc		    tupDesc;
+	PlanState		   *outerNode;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal
+		 * and thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(
+				tupDesc,
+				plannode->sort.numCols - presortedCols,
+				&(plannode->sort.sortColIdx[presortedCols]),
+				&(plannode->sort.sortOperators[presortedCols]),
+				&(plannode->sort.collations[presortedCols]),
+				&(plannode->sort.nullsFirst[presortedCols]),
+				work_mem,
+				NULL,
+				false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure
+	 * the tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+				node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+					ScanDirectionIsForward(dir),
+					false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to save the
+			 * first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/* The tuple isn't part of the current batch so we need to carry
+				 * it over into the next set up tuples we transfer out of the full
+				 * sort tuplesort into the presorted prefix tuplesort. We don't
+				 * actually have to do anything special to save the tuple since
+				 * we've already loaded it into the node->transfer_tuple slot, and,
+				 * even though that slot points to memory inside the full sort
+				 * tuplesort, we can't reset that tuplesort anyway until we've
+				 * fully transferred out of its tuples, so this reference is safe.
+				 * We do need to reset the group pivot tuple though since we've
+				 * finished the current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch
+		 * is in the same prefix key group and moved all of those tuples into
+		 * the presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/* Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume
+		 * we have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner node
+		 * read out all of those tuples, and then come back around to find
+		 * another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *read_sortstate;
+	Tuplesortstate	   *fullsort_state;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+			|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->finished)
+			/*
+			 * TODO: there isn't a good test case for the node->finished
+			 * case directly, but lots of other stuff fails if it's not
+			 * there. If the outer node will fail when trying to fetch
+			 * too many tuples, then things break if that test isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the full
+			 * sort tuplesort. The first call to switchToPresortedPrefixMode()
+			 * pulled the one of those groups out, and we've returned those
+			 * tuples to the inner node, but if we tuples remaining in that
+			 * tuplesort (i.e., n_fullsort_remaining > 0) at this point we
+			 * need to do that again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(node);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're not
+			 * in the middle of transitioning into presorted prefix sort mode,
+			 * then it's time to start the process all over again by building
+			 * new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(
+					tupDesc,
+					plannode->sort.numCols,
+					plannode->sort.sortColIdx,
+					plannode->sort.sortOperators,
+					plannode->sort.collations,
+					plannode->sort.nullsFirst,
+					work_mem,
+					NULL,
+					false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64 currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n heap
+			 * sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/* Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't assume
+			 * the group pivot tuple will reamin the same -- unless we're using
+			 * a minimum group size of 1, in which case the pivot is obviously
+			 * still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us anymore tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then don't
+				 * both with checking for inclusion in the current prefix group
+				 * since a large number of very tiny sorts is inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this
+					 * sort group. Instead we need to carry it over to the
+					 * next group. We use the group_pivot slot as a temp
+					 * container for that purpose even though we won't actually
+					 * treat it as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound remaining
+						 * is (original bound - n), so store the current number
+						 * of processed tuples for use in configuring sorting
+						 * bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+					tuplesort_performsort(fullsort_state);
+					node->fullsort_group_count++;
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found
+			 * a large group of tuples having a single prefix key (as long
+			 * as the last tuple didn't shift us into reading from the full
+			 * sort mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+					node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into the
+				 * tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n heapsort mode
+				 * then we will only be able to retrieve currentBound tuples (since the
+				 * tuplesort will have only retained the top-n tuples). This is safe even
+				 * though we haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the retained ones,
+				 * and we're already contractually guaranteed to not need any more than the
+				 * currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64 currentBound = node->bound - node->bound_Done;
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the fullsort
+				 * to presorted prefix sort (we might have multiple prefix key
+				 * groups, so we need a way to see if we've actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(node);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop out
+				 * of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've already
+		 * established a current group pivot tuple (but wasn't carried over;
+		 * it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix tuplesort
+				 * until we find changed prefix keys. Only then can we guarantee
+				 * sort stability of the tuples we've already accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		IncrementalSortInfo *incsort_info =
+			&node->shared_info->sinfo[ParallelWorkerNumber];
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		tuplesort_get_stats(fullsort_state, &incsort_info->fullsort_instrument);
+		incsort_info->fullsort_group_count = node->fullsort_group_count;
+
+		if (node->prefixsort_state)
+		{
+			tuplesort_get_stats(node->prefixsort_state,
+					&incsort_info->prefixsort_instrument);
+			incsort_info->prefixsort_group_count = node->prefixsort_group_count;
+		}
+	}
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_group_count = 0;
+	incrsortstate->prefixsort_group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+		tuplesort_end(node->fullsort_state);
+	node->fullsort_state = NULL;
+	if (node->prefixsort_state != NULL)
+		tuplesort_end(node->prefixsort_state);
+	node->prefixsort_state = NULL;
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..d2b9bd95ba 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721..d1748d1011 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..6e2ba08d7b 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int	n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none.
+	 * Any leading common pathkeys could be useful for ordering because
+	 * we can use the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..53d08aed2e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
 
-	cost_sort(&sort_path, root, NIL,
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	node = makeNode(Sort);
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314..e9918ffcb4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4924,8 +4924,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4964,29 +4964,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can take
+				 * advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3e836e6e1c..88402a9033 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2741,6 +2741,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index dbecc00fef..e21768207d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -988,6 +988,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..c2bd38f39f 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +250,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1250,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1323,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2724,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2774,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3271,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f..f905e384a2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfo	fcinfo; /* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2022,60 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	fullsort_instrument;
+	int64						fullsort_group_count;
+	TuplesortInstrumentation	prefixsort_instrument;
+	int64						prefixsort_group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64			n_fullsort_remaining;
+	Tuplesortstate	   *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate	   *prefixsort_state; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		fullsort_group_count;	/* number of groups with equal presorted keys */
+	int64		prefixsort_group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..fe4046b64b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba198..bfee4db721 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..57ecbbb01c 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..77c03149cd 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..3a58efdf91
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1160 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b9df37412f
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,78 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

#194

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#193)

5 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Hi,

I've been focusing on the planner part of the patch, particularly on
picking places need to consider incremental sort paths. As discussed in
the past, we might simply consider incremental sort everywhere we need
sorted path, but that's likely an overkill. So I used the set of queries
used for previous tests [1]https://github.com/tvondra/incremental-sort-tests-2 and determined which places affect the most
plans (using a separate GUC for each place, per [2]https://github.com/tvondra/postgres/tree/incremental-sort-20200309).

The result is attached, on top of the incremental sort patch. Part 0004
tweak all places to cover all plan changes in the test queries, and part
0005 tweaks a couple additional places where I think we should consider
incremental sort (but I've been unable to construct a query for which
it'd make a difference).

I've also removed the various GUCs and all the places are now using the
same enable_incrementalsort GUC. Compared to [2]https://github.com/tvondra/postgres/tree/incremental-sort-20200309 I've also modified the
code to consider full and incremental sort in the same loop, which does
allow reusing some of the work.

The next thing I work on is determining how expensive this can be, in
extreme cases with many indexes etc.

Now, a couple comments about parts 0001 - 0003 of the patch ...

1) I see a bunch of failures in the regression test, due to minor
differences in the explain output. All the differences are about minor
changes in memory usage, like this:

-               "Sort Space Used": 30,                             +
+               "Sort Space Used": 29,                             +

I'm not sure if it happens on my machine only, but maybe the test is not
entirely stable.

2) I think this bit in ExecReScanIncrementalSort is wrong:

node->sort_Done = false;
tuplesort_end(node->fullsort_state);
node->prefixsort_state = NULL;
tuplesort_end(node->fullsort_state);
node->prefixsort_state = NULL;
node->bound_Done = 0;

Notice both places reset fullsort_state and set prefixsort_state to
NULL. Another thing is that I'm not sure it's fine to pass NULL to
tuplesort_end (my guess is tuplesort_free will fail when it gets NULL).

3) Most of the execution plans look reasonable, except that some of the
plans look like this:

QUERY PLAN
---------------------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Incremental Sort
Sort Key: t.a, t.b, t.c, t.d
Presorted Key: t.a, t.b, t.c
-> Incremental Sort
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Index Scan using t_a_b_idx on t
(10 rows)

i.e. there are two incremental sorts on top of each other, with
different prefixes. But this this is not a new issue - it happens with
queries like this:

SELECT a, b, c, d, count(*) FROM (
SELECT * FROM t ORDER BY a, b, c
) foo GROUP BY a, b, c, d limit 1000;

i.e. there's a subquery with a subset of pathkeys. Without incremental
sort the plan looks like this:

QUERY PLAN
---------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c
-> Seq Scan on t
(8 rows)

so essentially the same plan shape. What bugs me though is that there
seems to be some sort of memory leak, so that this query consumes
gigabytes os RAM before it gets killed by OOM. But the memory seems not
to be allocated in any memory context (at least MemoryContextStats don't
show anything like that), so I'm not sure what's going on.

Reproducing it is fairly simple:

CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
INSERT INTO t SELECT
1000*random(), 1000*random(), 1000*random(), 1000*random()
FROM generate_series(1,10000000) s(i);
CREATE INDEX idx ON t(a,b);
ANALYZE t;

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 100;

[1]: https://github.com/tvondra/incremental-sort-tests-2

[2]: https://github.com/tvondra/postgres/tree/incremental-sort-20200309

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Consider-low-startup-cost-when-adding-parti-20200310.patchtext/plain; charset=us-asciiDownload

From 61773e100692a0e2823607a8ce6ba1e6d1e2d209 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH 1/5] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 47 ++++++++++-----------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d9ce516211..3e836e6e1c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -777,41 +777,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Unless pathkeys are incompatible, keep just one of the two paths. */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
-			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.21.1

0002-Implement-incremental-sort-20200310.patchtext/plain; charset=us-asciiDownload

From e9095b170cd4baaaae8272ee88641e416fa0f926 Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Fri, 27 Sep 2019 19:36:53 +0000
Subject: [PATCH 2/5] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   13 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   33 +
 src/backend/executor/nodeIncrementalSort.c    | 1107 ++++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  194 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  129 +-
 src/backend/optimizer/plan/planner.c          |   71 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  194 ++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   30 +
 src/include/nodes/execnodes.h                 |   68 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   11 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1160 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   78 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3505 insertions(+), 115 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 371d7838fb..64ea00f462 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4490,6 +4490,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..14aedec919 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+					   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1239,6 +1243,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1897,6 +1904,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2225,12 +2238,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2241,7 +2271,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2265,7 +2295,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2334,7 +2364,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2391,7 +2421,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2404,13 +2434,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2450,9 +2481,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2666,6 +2701,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	if (es->analyze && incrsortstate->sort_Done &&
+		incrsortstate->fullsort_state != NULL)
+	{
+		/* TODO: is it valid to get space used etc. only once given we re-use the sort? */
+		/* TODO: maybe show average, min, max sort group size? */
+
+		Tuplesortstate *fullsort_state = incrsortstate->fullsort_state;
+		TuplesortInstrumentation fullsort_stats;
+		const char *fullsort_sortMethod;
+		const char *fullsort_spaceType;
+		Tuplesortstate *prefixsort_state = incrsortstate->prefixsort_state;
+		TuplesortInstrumentation prefixsort_stats;
+		const char *prefixsort_sortMethod;
+		const char *prefixsort_spaceType;
+
+		tuplesort_get_stats(fullsort_state, &fullsort_stats);
+		fullsort_sortMethod = tuplesort_method_name(fullsort_stats.sortMethod);
+		fullsort_spaceType = tuplesort_space_type_name(fullsort_stats.spaceType);
+		if (prefixsort_state != NULL)
+		{
+			tuplesort_get_stats(prefixsort_state, &prefixsort_stats);
+			prefixsort_sortMethod = tuplesort_method_name(prefixsort_stats.sortMethod);
+			prefixsort_spaceType = tuplesort_space_type_name(prefixsort_stats.spaceType);
+		}
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Method: Full: %s  %s: %ldkB",
+							 fullsort_sortMethod, fullsort_spaceType,
+							 fullsort_stats.spaceUsed);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %s %s: %ldkB\n",
+								 prefixsort_sortMethod, prefixsort_spaceType,
+								 prefixsort_stats.spaceUsed);
+			else
+				appendStringInfo(es->str, "\n");
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(es->str, "Sort Groups: Full:  %ld",
+							 incrsortstate->fullsort_group_count);
+			if (prefixsort_state != NULL)
+				appendStringInfo(es->str, ", Prefix-only: %ld\n",
+							 incrsortstate->prefixsort_group_count);
+			else
+				appendStringInfo(es->str, "\n");
+		}
+		else
+		{
+			/* TODO */
+			ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+			ExplainPropertyInteger("Full Sort Space Used", "kB",
+					fullsort_stats.spaceUsed, es);
+			ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+			ExplainPropertyInteger("Full Sort Groups", NULL,
+								   incrsortstate->fullsort_group_count, es);
+
+			if (prefixsort_state != NULL)
+			{
+				ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+				ExplainPropertyInteger("Prefix Sort Space Used", "kB",
+						prefixsort_stats.spaceUsed, es);
+				ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+				ExplainPropertyInteger("Prefix Sort Groups", NULL,
+									   incrsortstate->prefixsort_group_count, es);
+			}
+		}
+	}
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+				&incrsortstate->shared_info->sinfo[n];
+			TuplesortInstrumentation *fullsort_instrument;
+			const char *fullsort_sortMethod;
+			const char *fullsort_spaceType;
+			long		fullsort_spaceUsed;
+			int64		fullsort_group_count;
+			TuplesortInstrumentation *prefixsort_instrument;
+			const char *prefixsort_sortMethod;
+			const char *prefixsort_spaceType;
+			long		prefixsort_spaceUsed;
+			int64		prefixsort_group_count;
+
+			fullsort_instrument = &incsort_info->fullsort_instrument;
+			fullsort_group_count = incsort_info->fullsort_group_count;
+
+			prefixsort_instrument = &incsort_info->prefixsort_instrument;
+			prefixsort_group_count = incsort_info->prefixsort_group_count;
+
+			if (fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+				continue;		/* ignore any unfilled slots */
+
+			fullsort_sortMethod = tuplesort_method_name(
+					fullsort_instrument->sortMethod);
+			fullsort_spaceType = tuplesort_space_type_name(
+					fullsort_instrument->spaceType);
+			fullsort_spaceUsed = fullsort_instrument->spaceUsed;
+
+			if (prefixsort_instrument)
+			{
+				prefixsort_sortMethod = tuplesort_method_name(
+						prefixsort_instrument->sortMethod);
+				prefixsort_spaceType = tuplesort_space_type_name(
+						prefixsort_instrument->spaceType);
+				prefixsort_spaceUsed = prefixsort_instrument->spaceUsed;
+			}
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfo(es->str,
+								 "Worker %d: Full Sort Method: %s  %s: %ldkB  Groups: %ld",
+								 n, fullsort_sortMethod, fullsort_spaceType,
+								 fullsort_spaceUsed, fullsort_group_count);
+				if (prefixsort_instrument)
+					appendStringInfo(es->str,
+									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+									 prefixsort_sortMethod, prefixsort_spaceType,
+									 prefixsort_spaceUsed, prefixsort_group_count);
+				else
+					appendStringInfo(es->str, "\n");
+			}
+			else
+			{
+				if (!opened_group)
+				{
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
+				}
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", NULL, n, es);
+				ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
+				ExplainPropertyInteger("Full Sort Space Used", "kB", fullsort_spaceUsed, es);
+				ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
+				ExplainPropertyInteger("Full Sort Groups", NULL, fullsort_group_count, es);
+				if (prefixsort_instrument)
+				{
+					ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
+					ExplainPropertyInteger("Prefix Sort Space Used", "kB", prefixsort_spaceUsed, es);
+					ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
+					ExplainPropertyInteger("Prefix Sort Groups", NULL, prefixsort_group_count, es);
+				}
+				ExplainCloseGroup("Worker", NULL, true, es);
+			}
+		}
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..cba648a95e 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,16 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group
+			 * of tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..8051f46a71 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState  *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..c3b903e568
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1107 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int					presortedCols,
+						i;
+
+	Assert(IsA(plannode, IncrementalSort));
+	presortedCols = plannode->presortedCols;
+
+	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+													sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (i = 0; i < presortedCols; i++)
+	{
+		Oid					equalityOp,
+							equalityFunc;
+		PresortedKeyData   *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(
+										plannode->sort.sortOperators[i], NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+					plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int presortedCols, i;
+
+	Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail keys
+	 * are more likely to change. Therefore we do our comparison starting from
+	 * the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum				datumA,
+							datumB,
+							result;
+		bool				isnullA,
+							isnullB;
+		AttrNumber			attno = node->presorted_keys[i].attno;
+		PresortedKeyData   *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(IncrementalSortState *node)
+{
+	ScanDirection		dir;
+	int64 nTuples = 0;
+	bool lastTuple = false;
+	bool firstTuple = true;
+	TupleDesc		    tupDesc;
+	PlanState		   *outerNode;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal
+		 * and thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(
+				tupDesc,
+				plannode->sort.numCols - presortedCols,
+				&(plannode->sort.sortColIdx[presortedCols]),
+				&(plannode->sort.sortOperators[presortedCols]),
+				&(plannode->sort.collations[presortedCols]),
+				&(plannode->sort.nullsFirst[presortedCols]),
+				work_mem,
+				NULL,
+				false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure
+	 * the tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+				node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+					ScanDirectionIsForward(dir),
+					false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to save the
+			 * first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/* The tuple isn't part of the current batch so we need to carry
+				 * it over into the next set up tuples we transfer out of the full
+				 * sort tuplesort into the presorted prefix tuplesort. We don't
+				 * actually have to do anything special to save the tuple since
+				 * we've already loaded it into the node->transfer_tuple slot, and,
+				 * even though that slot points to memory inside the full sort
+				 * tuplesort, we can't reset that tuplesort anyway until we've
+				 * fully transferred out of its tuples, so this reference is safe.
+				 * We do need to reset the group pivot tuple though since we've
+				 * finished the current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch
+		 * is in the same prefix key group and moved all of those tuples into
+		 * the presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/* Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume
+		 * we have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner node
+		 * read out all of those tuples, and then come back around to find
+		 * another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState			   *estate;
+	ScanDirection		dir;
+	Tuplesortstate	   *read_sortstate;
+	Tuplesortstate	   *fullsort_state;
+	TupleTableSlot	   *slot;
+	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState		   *outerNode;
+	TupleDesc			tupDesc;
+	int64				nTuples = 0;
+	int64				minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+			|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->finished)
+			/*
+			 * TODO: there isn't a good test case for the node->finished
+			 * case directly, but lots of other stuff fails if it's not
+			 * there. If the outer node will fail when trying to fetch
+			 * too many tuples, then things break if that test isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the full
+			 * sort tuplesort. The first call to switchToPresortedPrefixMode()
+			 * pulled the one of those groups out, and we've returned those
+			 * tuples to the inner node, but if we tuples remaining in that
+			 * tuplesort (i.e., n_fullsort_remaining > 0) at this point we
+			 * need to do that again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(node);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're not
+			 * in the middle of transitioning into presorted prefix sort mode,
+			 * then it's time to start the process all over again by building
+			 * new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(
+					tupDesc,
+					plannode->sort.numCols,
+					plannode->sort.sortColIdx,
+					plannode->sort.sortOperators,
+					plannode->sort.collations,
+					plannode->sort.nullsFirst,
+					work_mem,
+					NULL,
+					false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64 currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n heap
+			 * sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/* Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't assume
+			 * the group pivot tuple will reamin the same -- unless we're using
+			 * a minimum group size of 1, in which case the pivot is obviously
+			 * still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us anymore tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then don't
+				 * both with checking for inclusion in the current prefix group
+				 * since a large number of very tiny sorts is inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this
+					 * sort group. Instead we need to carry it over to the
+					 * next group. We use the group_pivot slot as a temp
+					 * container for that purpose even though we won't actually
+					 * treat it as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound remaining
+						 * is (original bound - n), so store the current number
+						 * of processed tuples for use in configuring sorting
+						 * bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+					tuplesort_performsort(fullsort_state);
+					node->fullsort_group_count++;
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found
+			 * a large group of tuples having a single prefix key (as long
+			 * as the last tuple didn't shift us into reading from the full
+			 * sort mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+					node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into the
+				 * tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				node->fullsort_group_count++;
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n heapsort mode
+				 * then we will only be able to retrieve currentBound tuples (since the
+				 * tuplesort will have only retained the top-n tuples). This is safe even
+				 * though we haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the retained ones,
+				 * and we're already contractually guaranteed to not need any more than the
+				 * currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64 currentBound = node->bound - node->bound_Done;
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the fullsort
+				 * to presorted prefix sort (we might have multiple prefix key
+				 * groups, so we need a way to see if we've actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(node);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop out
+				 * of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've already
+		 * established a current group pivot tuple (but wasn't carried over;
+		 * it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix tuplesort
+				 * until we find changed prefix keys. Only then can we guarantee
+				 * sort stability of the tuples we've already accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+		node->prefixsort_group_count++;
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is
+			 * (original bound - n), so store the current number of processed
+			 * tuples for use in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		IncrementalSortInfo *incsort_info =
+			&node->shared_info->sinfo[ParallelWorkerNumber];
+
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		tuplesort_get_stats(fullsort_state, &incsort_info->fullsort_instrument);
+		incsort_info->fullsort_group_count = node->fullsort_group_count;
+
+		if (node->prefixsort_state)
+		{
+			tuplesort_get_stats(node->prefixsort_state,
+					&incsort_info->prefixsort_instrument);
+			incsort_info->prefixsort_group_count = node->prefixsort_group_count;
+		}
+	}
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState   *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_group_count = 0;
+	incrsortstate->prefixsort_group_count = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info because
+	 * this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple = MakeSingleTupleTableSlot(
+							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+		tuplesort_end(node->fullsort_state);
+	node->fullsort_state = NULL;
+	if (node->prefixsort_state != NULL)
+		tuplesort_end(node->prefixsort_state);
+	node->prefixsort_state = NULL;
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..d2b9bd95ba 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721..d1748d1011 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+		  double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+						linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys
+	 * are equal.  Incremental sort is sensitive to distribution of tuples
+	 * to the groups, where we're relying on quite rough assumptions.  Thus,
+	 * we're pessimistic about incremental sort performance and increase
+	 * its average group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing
+	 * this group, plus the total cost to process the remaining groups,
+	 * plus the remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost startup_cost;
+	Cost run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..6e2ba08d7b 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int		n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int	n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none.
+	 * Any leading common pathkeys could be useful for ordering because
+	 * we can use the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..53d08aed2e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+									IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+		  int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+						List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort	   *plan;
+	Plan			   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+								best_path->spath.path.pathkeys,
+								IS_OTHER_REL(best_path->spath.subpath->parent) ?
+								best_path->spath.path.parent->relids : NULL,
+								best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
+	Cost		startup_cost,
+				run_cost;
 
-	cost_sort(&sort_path, root, NIL,
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans
+	 * because they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
 			  0.0,
 			  work_mem,
 			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
+
+	node = makeNode(Sort);
 
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+		  AttrNumber *sortColIdx, Oid *sortOperators,
+		  Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort	   *node;
+	Plan			   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+						Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314..e9918ffcb4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4924,8 +4924,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4964,29 +4964,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can take
+				 * advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3e836e6e1c..88402a9033 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2741,6 +2741,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+			  root, pathkeys, presorted_keys,
+			  subpath->startup_cost,
+			  subpath->total_cost,
+			  subpath->rows,
+			  subpath->pathtarget->width,
+			  0.0,				/* XXX comparison_cost shouldn't be 0? */
+			  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c6d648662..4949ef2079 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..c2bd38f39f 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +250,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								   of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
+								   space, false when it's value for in-memory
+								   space */
+	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+								   that persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1250,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1323,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64	spaceUsed;
+	bool	spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data
+	 * to the main memory.  This is why we assume space used on the disk to
+	 * be more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2724,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2774,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+										numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3271,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f..f905e384a2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo				flinfo;	/* comparison function info */
+	FunctionCallInfo	fcinfo; /* comparison function call info */
+	OffsetNumber			attno;	/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2022,60 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+	TuplesortInstrumentation	fullsort_instrument;
+	int64						fullsort_group_count;
+	TuplesortInstrumentation	prefixsort_instrument;
+	int64						prefixsort_group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+	int							num_workers;
+	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node
+								   is finished ? */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64			n_fullsort_remaining;
+	Tuplesortstate	   *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate	   *prefixsort_state; /* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+	int64		fullsort_group_count;	/* number of groups with equal presorted keys */
+	int64		prefixsort_group_count;	/* number of groups with equal presorted keys */
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..fe4046b64b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba198..bfee4db721 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+		  PlannerInfo *root, List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+		  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..57ecbbb01c 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+				 RelOptInfo *rel,
+				 Path *subpath,
+				 List *pathkeys,
+				 int presorted_keys,
+				 double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..77c03149cd 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..3a58efdf91
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1160 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b9df37412f
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,78 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.21.1

0003-Rework-EXPLAIN-for-incremental-sort-20200310.patchtext/plain; charset=us-asciiDownload

From fe65713ecad2fb46cf66adb3e2e58307ee0432d8 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sat, 7 Mar 2020 17:09:39 -0500
Subject: [PATCH 3/5] Rework EXPLAIN for incremental sort

---
 src/backend/commands/explain.c                | 253 +++++++++---------
 src/backend/executor/nodeIncrementalSort.c    | 121 ++++++---
 src/include/nodes/execnodes.h                 |  29 +-
 .../regress/expected/incremental_sort.out     | 160 +++++++++++
 src/test/regress/sql/incremental_sort.sql     |  10 +
 5 files changed, 402 insertions(+), 171 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 14aedec919..8262c54e6a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2701,80 +2701,114 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
-/*
- * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
- */
+
 static void
-show_incremental_sort_info(IncrementalSortState *incrsortstate,
-						   ExplainState *es)
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+						   const char *groupLabel, ExplainState *es)
 {
-	if (es->analyze && incrsortstate->sort_Done &&
-		incrsortstate->fullsort_state != NULL)
+	const char *sortMethodName;
+	const char *spaceTypeName;
+	ListCell *methodCell;
+	int methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
-		/* TODO: is it valid to get space used etc. only once given we re-use the sort? */
-		/* TODO: maybe show average, min, max sort group size? */
-
-		Tuplesortstate *fullsort_state = incrsortstate->fullsort_state;
-		TuplesortInstrumentation fullsort_stats;
-		const char *fullsort_sortMethod;
-		const char *fullsort_spaceType;
-		Tuplesortstate *prefixsort_state = incrsortstate->prefixsort_state;
-		TuplesortInstrumentation prefixsort_stats;
-		const char *prefixsort_sortMethod;
-		const char *prefixsort_spaceType;
-
-		tuplesort_get_stats(fullsort_state, &fullsort_stats);
-		fullsort_sortMethod = tuplesort_method_name(fullsort_stats.sortMethod);
-		fullsort_spaceType = tuplesort_space_type_name(fullsort_stats.spaceType);
-		if (prefixsort_state != NULL)
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
 		{
-			tuplesort_get_stats(prefixsort_state, &prefixsort_stats);
-			prefixsort_sortMethod = tuplesort_method_name(prefixsort_stats.sortMethod);
-			prefixsort_spaceType = tuplesort_space_type_name(prefixsort_stats.spaceType);
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
 		}
+		appendStringInfo(es->str, ")");
 
-		if (es->format == EXPLAIN_FORMAT_TEXT)
+		if (groupInfo->maxMemorySpaceUsed > 0)
 		{
-			appendStringInfoSpaces(es->str, es->indent * 2);
-			appendStringInfo(es->str, "Sort Method: Full: %s  %s: %ldkB",
-							 fullsort_sortMethod, fullsort_spaceType,
-							 fullsort_stats.spaceUsed);
-			if (prefixsort_state != NULL)
-				appendStringInfo(es->str, ", Prefix-only: %s %s: %ldkB\n",
-								 prefixsort_sortMethod, prefixsort_spaceType,
-								 prefixsort_stats.spaceUsed);
-			else
-				appendStringInfo(es->str, "\n");
-			appendStringInfoSpaces(es->str, es->indent * 2);
-			appendStringInfo(es->str, "Sort Groups: Full:  %ld",
-							 incrsortstate->fullsort_group_count);
-			if (prefixsort_state != NULL)
-				appendStringInfo(es->str, ", Prefix-only: %ld\n",
-							 incrsortstate->prefixsort_group_count);
-			else
-				appendStringInfo(es->str, "\n");
+			long avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
 		}
-		else
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
 		{
-			/* TODO */
-			ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
-			ExplainPropertyInteger("Full Sort Space Used", "kB",
-					fullsort_stats.spaceUsed, es);
-			ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
-			ExplainPropertyInteger("Full Sort Groups", NULL,
-								   incrsortstate->fullsort_group_count, es);
-
-			if (prefixsort_state != NULL)
-			{
-				ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
-				ExplainPropertyInteger("Prefix Sort Space Used", "kB",
-						prefixsort_stats.spaceUsed, es);
-				ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
-				ExplainPropertyInteger("Prefix Sort Groups", NULL,
-									   incrsortstate->prefixsort_group_count, es);
-			}
+			long avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			methodNames = lappend(methodNames, sortMethodName);
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+					groupInfo->maxMemorySpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+					groupInfo->maxDiskSpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
 		}
+
+		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
 	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	if (!(es->analyze && incrsortstate->sort_Done))
+		return;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+	if (fullsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
 
 	if (incrsortstate->shared_info != NULL)
 	{
@@ -2785,79 +2819,36 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 		{
 			IncrementalSortInfo *incsort_info =
 				&incrsortstate->shared_info->sinfo[n];
-			TuplesortInstrumentation *fullsort_instrument;
-			const char *fullsort_sortMethod;
-			const char *fullsort_spaceType;
-			long		fullsort_spaceUsed;
-			int64		fullsort_group_count;
-			TuplesortInstrumentation *prefixsort_instrument;
-			const char *prefixsort_sortMethod;
-			const char *prefixsort_spaceType;
-			long		prefixsort_spaceUsed;
-			int64		prefixsort_group_count;
-
-			fullsort_instrument = &incsort_info->fullsort_instrument;
-			fullsort_group_count = incsort_info->fullsort_group_count;
-
-			prefixsort_instrument = &incsort_info->prefixsort_instrument;
-			prefixsort_group_count = incsort_info->prefixsort_group_count;
-
-			if (fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
-				continue;		/* ignore any unfilled slots */
-
-			fullsort_sortMethod = tuplesort_method_name(
-					fullsort_instrument->sortMethod);
-			fullsort_spaceType = tuplesort_space_type_name(
-					fullsort_instrument->spaceType);
-			fullsort_spaceUsed = fullsort_instrument->spaceUsed;
+			/*
+			 * XXX: The previous version of the patch chcked:
+			 * fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+			 * and continued if the condition was true (with the comment "ignore
+			 * any unfilled slots").
+			 * I'm not convinced that makes sense since the same sort instrument
+			 * can have been used multiple times, so the last time it being used
+			 * being still in progress, doesn't seem to be relevant.
+			 * Instead I'm now checking to see if the group count for each group
+			 * info is 0. If both are 0, then we exclude the worker since it
+			 * didn't contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+					prefixsortGroupInfo->groupCount == 0)
+				continue;
 
-			if (prefixsort_instrument)
+			if (!opened_group)
 			{
-				prefixsort_sortMethod = tuplesort_method_name(
-						prefixsort_instrument->sortMethod);
-				prefixsort_spaceType = tuplesort_space_type_name(
-						prefixsort_instrument->spaceType);
-				prefixsort_spaceUsed = prefixsort_instrument->spaceUsed;
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
 			}
 
-			if (es->format == EXPLAIN_FORMAT_TEXT)
-			{
-				appendStringInfoSpaces(es->str, es->indent * 2);
-				appendStringInfo(es->str,
-								 "Worker %d: Full Sort Method: %s  %s: %ldkB  Groups: %ld",
-								 n, fullsort_sortMethod, fullsort_spaceType,
-								 fullsort_spaceUsed, fullsort_group_count);
-				if (prefixsort_instrument)
-					appendStringInfo(es->str,
-									 ", Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
-									 prefixsort_sortMethod, prefixsort_spaceType,
-									 prefixsort_spaceUsed, prefixsort_group_count);
-				else
-					appendStringInfo(es->str, "\n");
-			}
-			else
-			{
-				if (!opened_group)
-				{
-					ExplainOpenGroup("Workers", "Workers", false, es);
-					opened_group = true;
-				}
-				ExplainOpenGroup("Worker", NULL, true, es);
-				ExplainPropertyInteger("Worker Number", NULL, n, es);
-				ExplainPropertyText("Full Sort Method", fullsort_sortMethod, es);
-				ExplainPropertyInteger("Full Sort Space Used", "kB", fullsort_spaceUsed, es);
-				ExplainPropertyText("Full Sort Space Type", fullsort_spaceType, es);
-				ExplainPropertyInteger("Full Sort Groups", NULL, fullsort_group_count, es);
-				if (prefixsort_instrument)
-				{
-					ExplainPropertyText("Prefix Sort Method", prefixsort_sortMethod, es);
-					ExplainPropertyInteger("Prefix Sort Space Used", "kB", prefixsort_spaceUsed, es);
-					ExplainPropertyText("Prefix Sort Space Type", prefixsort_spaceType, es);
-					ExplainPropertyInteger("Prefix Sort Groups", NULL, prefixsort_group_count, es);
-				}
-				ExplainCloseGroup("Worker", NULL, true, es);
-			}
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
 		}
+
 		if (opened_group)
 			ExplainCloseGroup("Workers", "Workers", false, es);
 	}
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index c3b903e568..e6f749a798 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -68,6 +68,47 @@
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
 
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+	Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation	sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+				sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+				&node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
 /*
  * Prepare information for presorted_keys comparison.
  */
@@ -199,8 +240,9 @@ isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot
  * one different prefix key group before the large prefix key group.
  */
 static void
-switchToPresortedPrefixMode(IncrementalSortState *node)
+switchToPresortedPrefixMode(PlanState *pstate)
 {
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
 	ScanDirection		dir;
 	int64 nTuples = 0;
 	bool lastTuple = false;
@@ -355,7 +397,11 @@ switchToPresortedPrefixMode(IncrementalSortState *node)
 		 */
 		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
 		tuplesort_performsort(node->prefixsort_state);
-		node->prefixsort_group_count++;
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+					&node->incsort_info.prefixsortGroupInfo,
+					node->prefixsort_state);
 
 		if (node->bounded)
 		{
@@ -479,7 +525,7 @@ ExecIncrementalSort(PlanState *pstate)
 			 */
 			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
 					node->n_fullsort_remaining);
-			switchToPresortedPrefixMode(node);
+			switchToPresortedPrefixMode(pstate);
 		}
 		else
 		{
@@ -602,7 +648,11 @@ ExecIncrementalSort(PlanState *pstate)
 
 				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
 				tuplesort_performsort(fullsort_state);
-				node->fullsort_group_count++;
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+							&node->incsort_info.fullsortGroupInfo,
+							fullsort_state);
 
 				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
 				node->execution_status = INCSORT_READFULLSORT;
@@ -673,7 +723,12 @@ ExecIncrementalSort(PlanState *pstate)
 					 */
 					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
 					tuplesort_performsort(fullsort_state);
-					node->fullsort_group_count++;
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+								&node->incsort_info.fullsortGroupInfo,
+								fullsort_state);
+
 					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
 					node->execution_status = INCSORT_READFULLSORT;
 					break;
@@ -705,7 +760,10 @@ ExecIncrementalSort(PlanState *pstate)
 				 */
 				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
 				tuplesort_performsort(fullsort_state);
-				node->fullsort_group_count++;
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+							&node->incsort_info.fullsortGroupInfo,
+							fullsort_state);
 
 				/*
 				 * If the full sort tuplesort happened to switch into top-n heapsort mode
@@ -735,7 +793,7 @@ ExecIncrementalSort(PlanState *pstate)
 				node->n_fullsort_remaining = nTuples;
 
 				/* Transition the tuples to the presorted prefix tuplesort. */
-				switchToPresortedPrefixMode(node);
+				switchToPresortedPrefixMode(pstate);
 
 				/*
 				 * Since we know we had tuples to move to the presorted prefix
@@ -801,7 +859,12 @@ ExecIncrementalSort(PlanState *pstate)
 		/* Perform the sort and return the tuples to the inner plan nodes. */
 		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
 		tuplesort_performsort(node->prefixsort_state);
-		node->prefixsort_group_count++;
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+					&node->incsort_info.prefixsortGroupInfo,
+					node->prefixsort_state);
+
 		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
 		node->execution_status = INCSORT_READPREFIXSORT;
 
@@ -828,26 +891,6 @@ ExecIncrementalSort(PlanState *pstate)
 	 */
 	node->sort_Done = true;
 
-	/* Record shared stats if we're a parallel worker. */
-	if (node->shared_info && node->am_worker)
-	{
-		IncrementalSortInfo *incsort_info =
-			&node->shared_info->sinfo[ParallelWorkerNumber];
-
-		Assert(IsParallelWorker());
-		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
-
-		tuplesort_get_stats(fullsort_state, &incsort_info->fullsort_instrument);
-		incsort_info->fullsort_group_count = node->fullsort_group_count;
-
-		if (node->prefixsort_state)
-		{
-			tuplesort_get_stats(node->prefixsort_state,
-					&incsort_info->prefixsort_instrument);
-			incsort_info->prefixsort_group_count = node->prefixsort_group_count;
-		}
-	}
-
 	/*
 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
 	 * tuples.
@@ -900,10 +943,28 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 	incrsortstate->transfer_tuple = NULL;
 	incrsortstate->n_fullsort_remaining = 0;
 	incrsortstate->bound_Done = 0;
-	incrsortstate->fullsort_group_count = 0;
-	incrsortstate->prefixsort_group_count = 0;
 	incrsortstate->presorted_keys = NULL;
 
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+			&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+			&incrsortstate->incsort_info.prefixsortGroupInfo;
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
 	/*
 	 * Miscellaneous initialization
 	 *
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f905e384a2..0934482123 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2022,18 +2022,26 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
-/* ----------------
- *	 Shared memory container for per-worker incremental sort information
- * ----------------
- */
+typedef struct IncrementalSortGroupInfo
+{
+	int64 groupCount;
+	long maxDiskSpaceUsed;
+	long totalDiskSpaceUsed;
+	long maxMemorySpaceUsed;
+	long totalMemorySpaceUsed;
+	List *sortMethods;
+} IncrementalSortGroupInfo;
+
 typedef struct IncrementalSortInfo
 {
-	TuplesortInstrumentation	fullsort_instrument;
-	int64						fullsort_group_count;
-	TuplesortInstrumentation	prefixsort_instrument;
-	int64						prefixsort_group_count;
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
 } IncrementalSortInfo;
 
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
 typedef struct SharedIncrementalSortInfo
 {
 	int							num_workers;
@@ -2067,8 +2075,9 @@ typedef struct IncrementalSortState
 	Tuplesortstate	   *prefixsort_state; /* private state of tuplesort.c */
 	/* the keys by which the input path is already sorted */
 	PresortedKeyData *presorted_keys;
-	int64		fullsort_group_count;	/* number of groups with equal presorted keys */
-	int64		prefixsort_group_count;	/* number of groups with equal presorted keys */
+
+	IncrementalSortInfo incsort_info;
+
 	/* slot for pivot tuple defining values of presorted keys within group */
 	TupleTableSlot *group_pivot;
 	TupleTableSlot *transfer_tuple;
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 3a58efdf91..7892b111d7 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -436,6 +436,82 @@ select * from (select * from t order by a) s order by a, b limit 55;
  2 | 55
 (55 rows)
 
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+                                           QUERY PLAN                                            
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: 27kB (avg), 27kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ [                                                                +
+   {                                                              +
+     "Plan": {                                                    +
+       "Node Type": "Limit",                                      +
+       "Parallel Aware": false,                                   +
+       "Actual Rows": 55,                                         +
+       "Actual Loops": 1,                                         +
+       "Plans": [                                                 +
+         {                                                        +
+           "Node Type": "Incremental Sort",                       +
+           "Parent Relationship": "Outer",                        +
+           "Parallel Aware": false,                               +
+           "Actual Rows": 55,                                     +
+           "Actual Loops": 1,                                     +
+           "Sort Key": ["t.a", "t.b"],                            +
+           "Presorted Key": ["t.a"],                              +
+           "Full-sort Groups": {                                  +
+             "Group Count": 2,                                    +
+             "Sort Methods Used": ["quicksort", "top-N heapsort"],+
+             "Average Sort Space Used": 27,                       +
+             "Maximum Sort Space Used": 27,                       +
+             "Sort Space Type": "Memory"                          +
+           },                                                     +
+           "Plans": [                                             +
+             {                                                    +
+               "Node Type": "Sort",                               +
+               "Parent Relationship": "Outer",                    +
+               "Parallel Aware": false,                           +
+               "Actual Rows": 100,                                +
+               "Actual Loops": 1,                                 +
+               "Sort Key": ["t.a"],                               +
+               "Sort Method": "quicksort",                        +
+               "Sort Space Used": 30,                             +
+               "Sort Space Type": "Memory",                       +
+               "Plans": [                                         +
+                 {                                                +
+                   "Node Type": "Seq Scan",                       +
+                   "Parent Relationship": "Outer",                +
+                   "Parallel Aware": false,                       +
+                   "Relation Name": "t",                          +
+                   "Alias": "t",                                  +
+                   "Actual Rows": 100,                            +
+                   "Actual Loops": 1                              +
+                 }                                                +
+               ]                                                  +
+             }                                                    +
+           ]                                                      +
+         }                                                        +
+       ]                                                          +
+     },                                                           +
+     "Triggers": [                                                +
+     ]                                                            +
+   }                                                              +
+ ]
+(1 row)
+
 delete from t;
 -- An initial small group followed by a large group.
 insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
@@ -526,6 +602,90 @@ select * from (select * from t order by a) s order by a, b limit 70;
  9 | 70
 (70 rows)
 
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: 28kB (avg), 28kB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: 26kB (avg), 30kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+                    QUERY PLAN                     
+---------------------------------------------------
+ [                                                +
+   {                                              +
+     "Plan": {                                    +
+       "Node Type": "Limit",                      +
+       "Parallel Aware": false,                   +
+       "Actual Rows": 70,                         +
+       "Actual Loops": 1,                         +
+       "Plans": [                                 +
+         {                                        +
+           "Node Type": "Incremental Sort",       +
+           "Parent Relationship": "Outer",        +
+           "Parallel Aware": false,               +
+           "Actual Rows": 70,                     +
+           "Actual Loops": 1,                     +
+           "Sort Key": ["t.a", "t.b"],            +
+           "Presorted Key": ["t.a"],              +
+           "Full-sort Groups": {                  +
+             "Group Count": 1,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 28,       +
+             "Maximum Sort Space Used": 28,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Presorted Groups": {                  +
+             "Group Count": 5,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 26,       +
+             "Maximum Sort Space Used": 30,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Plans": [                             +
+             {                                    +
+               "Node Type": "Sort",               +
+               "Parent Relationship": "Outer",    +
+               "Parallel Aware": false,           +
+               "Actual Rows": 100,                +
+               "Actual Loops": 1,                 +
+               "Sort Key": ["t.a"],               +
+               "Sort Method": "quicksort",        +
+               "Sort Space Used": 30,             +
+               "Sort Space Type": "Memory",       +
+               "Plans": [                         +
+                 {                                +
+                   "Node Type": "Seq Scan",       +
+                   "Parent Relationship": "Outer",+
+                   "Parallel Aware": false,       +
+                   "Relation Name": "t",          +
+                   "Alias": "t",                  +
+                   "Actual Rows": 100,            +
+                   "Actual Loops": 1              +
+                 }                                +
+               ]                                  +
+             }                                    +
+           ]                                      +
+         }                                        +
+       ]                                          +
+     },                                           +
+     "Triggers": [                                +
+     ]                                            +
+   }                                              +
+ ]
+(1 row)
+
 delete from t;
 -- Small groups of 10 tuples each tested around each mode transition point.
 insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index b9df37412f..9320a10b91 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -39,12 +39,22 @@ delete from t;
 insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
 select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
 delete from t;
 
 -- An initial small group followed by a large group.
 insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
 select * from (select * from t order by a) s order by a, b limit 70;
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
 delete from t;
 
 -- Small groups of 10 tuples each tested around each mode transition point.
-- 
2.21.1

0004-Consider-incremental-sort-paths-in-addition-20200310.patchtext/plain; charset=us-asciiDownload

From 2b2b9acab041cfe5c345b528803c701aec0e2a91 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH 4/5] Consider incremental sort paths in additional places

---
 src/backend/optimizer/path/allpaths.c | 222 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  | 130 ++++++++++++++-
 src/include/optimizer/paths.h         |   2 +
 3 files changed, 351 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..8d9c25e18f 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,224 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Pushing the query_pathkeys to the remote server is always worth
+	 * considering, because it might let us avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * The planner and executor don't have any clever strategy for
+			 * taking data sorted by a prefix of the query's pathkeys and
+			 * getting it to be sorted by all of those pathkeys. We'll just
+			 * end up resorting the entire data set.  So, unless we can push
+			 * down all of the query pathkeys, forget it.
+			 *
+			 * is_foreign_expr would detect volatile expressions as well, but
+			 * checking ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		/*
+		 * This ends up allowing us to do incremental sort on top of
+		 * an index scan all parallelized under a gather merge node.
+		*/
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike generate_gather_paths, this does not look just as pathkeys of the
+ * input paths (aiming to preserve the ordering). It also considers ordering
+ * that might be useful by nodes above the gather merge node, and tries to
+ * add a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather merge paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			/* now we know is_sorted == false */
+
+			/*
+			 * consider regular sort for cheapest partial path (for each
+			 * useful pathkeys)
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* continue */
+			}
+
+			/* finally, consider incremental sort */
+			if (presorted_keys > 0)
+			{
+				Path *tmp;
+
+				/* Also consider incremental sort. */
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3117,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e9918ffcb4..84ed69ec5e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6424,7 +6424,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6483,6 +6485,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have addes Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6807,7 +6883,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6842,6 +6920,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have addes Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7223,7 +7351,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 77c03149cd..d778b884a9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
-- 
2.21.1

0005-A-couple-more-places-for-incremental-sort-20200310.patchtext/plain; charset=us-asciiDownload

From 424f7a261dd0e40ecd0df2552e8afdc3605d7fff Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH 5/5] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/plan/planner.c   | 218 ++++++++++++++++++++++++-
 2 files changed, 216 insertions(+), 4 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 84ed69ec5e..15223017c0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5070,6 +5070,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell *lc;
+
+			foreach (lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we
+				 * can't simply skip it, because it may be partially sorted
+				 * in which case we want to consider incremental sort on top
+				 * of it (instead of full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* also ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6570,12 +6631,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6606,6 +6673,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental sort
+				 * is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have addes Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6875,6 +6992,60 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Also consider incremental sort on all partially sorted paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* also ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* add incremental sort */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -7067,10 +7238,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7096,6 +7268,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach (lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7197,7 +7409,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.21.1

#195

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#194)

8 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

I gave this a very quick look; I don't claim to understand it or
anything, but I thought these trivial cleanups worthwhile. The only
non-cosmetic thing is changing order of arguments to the SOn_printf()
calls in 0008; I think they are contrary to what the comment says.

I don't propose to commit 0003 of course, since it's not our policy;
that's just to allow running pgindent sanely, which gives you 0004
(though my local pgindent has an unrelated fix). And after that you
notice the issue that 0005 fixes.

I did notice that show_incremental_sort_group_info() seems to be doing
things in a hard way, or something. I got there because it throws this
warning:

/pgsql/source/master/src/backend/commands/explain.c: In function 'show_incremental_sort_group_info':
/pgsql/source/master/src/backend/commands/explain.c:2766:39: warning: passing argument 2 of 'lappend' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
methodNames = lappend(methodNames, sortMethodName);
^~~~~~~~~~~~~~
In file included from /pgsql/source/master/src/include/access/xact.h:20,
from /pgsql/source/master/src/backend/commands/explain.c:16:
/pgsql/source/master/src/include/nodes/pg_list.h:509:14: note: expected 'void *' but argument is of type 'const char *'
extern List *lappend(List *list, void *datum);
^~~~~~~
/pgsql/source/master/src/backend/commands/explain.c:2766:39: warning: passing 'const char *' to parameter of type 'void *' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
methodNames = lappend(methodNames, sortMethodName);
^~~~~~~~~~~~~~
/pgsql/source/master/src/include/nodes/pg_list.h:509:40: note: passing argument to parameter 'datum' here
extern List *lappend(List *list, void *datum);
^
1 warning generated.

(Eh, it's funny that GCC reports two warnings about the same line, and
then says there's one warning.)

I suppose you could silence this by adding pstrdup(), and then use
list_free_deep (you have to put the sortMethodName declaration in the
inner scope for that, but seems fine). Or maybe there's a clever way
around it.

But I hesitate to send a patch for that because I think the whole
function is written by handling text and the other outputs completely
separately -- but looking for example show_modifytable_info() it seems
you can do ExplainOpenGroup, ExplainPropertyText, ExplainPropertyList
etc in all explain output modes, and those routines will care about
emitting the data in the correct format, without having the
show_incremental_sort_group_info function duplicate everything.

HTH. I would really like to get this patch done for pg13.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-fix-typo.patchtext/x-diff; charset=us-asciiDownload

From f0e70563197ffd04c2afc7f2221d489561267669 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 12 Mar 2020 17:43:48 -0300
Subject: [PATCH 1/8] fix typo

---
 src/backend/executor/nodeIncrementalSort.c | 6 +++---
 src/include/executor/nodeIncrementalSort.h | 4 +---
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index e6f749a798..44c6c17fc6 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -1,6 +1,6 @@
 /*-------------------------------------------------------------------------
  *
- * nodeIncremenalSort.c
+ * nodeIncrementalSort.c
  *	  Routines to handle incremental sorting of relations.
  *
  * DESCRIPTION
@@ -49,12 +49,12 @@
  *		it can start producing rows early, before sorting the whole dataset,
  *		which is a significant benefit especially for queries with LIMIT.
  *
- * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
- *	  src/backend/executor/nodeIncremenalSort.c
+ *	  src/backend/executor/nodeIncrementalSort.c
  *
  *-------------------------------------------------------------------------
  */
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
index 90d7a81711..3113989272 100644
--- a/src/include/executor/nodeIncrementalSort.h
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -2,9 +2,7 @@
  *
  * nodeIncrementalSort.h
  *
- *
- *
- * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  * src/include/executor/nodeIncrementalSort.h
-- 
2.20.1

0002-fix-another-typo.patchtext/x-diff; charset=us-asciiDownload

From bd11b01a1322108194750ff81ee317e1e77fc048 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 12 Mar 2020 18:26:05 -0300
Subject: [PATCH 2/8] fix another typo

---
 src/backend/executor/nodeIncrementalSort.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 44c6c17fc6..e2cb9511ba 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -639,7 +639,7 @@ ExecIncrementalSort(PlanState *pstate)
 			slot = ExecProcNode(outerNode);
 
 			/*
-			 * When the outer node can't provide us anymore tuples, then we
+			 * When the outer node can't provide us any more tuples, then we
 			 * can sort the current group and return those tuples.
 			 */
 			if (TupIsNull(slot))
-- 
2.20.1

0003-typedefs-additions.patchtext/x-diff; charset=us-asciiDownload

From ad873efb35d7ea45b270739880ec2ee6d4eafefb Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 12 Mar 2020 17:51:43 -0300
Subject: [PATCH 3/8] typedefs additions

---
 src/tools/pgindent/typedefs.list | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e216de9570..c34014660d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1,3 +1,12 @@
+ExplainWorkersState
+IncrementalSortState
+IncrementalSortGroupInfo
+IncrementalSort
+IncrementalSortInfo
+SharedIncrementalSortInfo
+IncrementalSortExecutionStatus
+IncrementalSortPath
+PresortedKeyData
 ABITVEC
 ACCESS_ALLOWED_ACE
 ACL
-- 
2.20.1

0004-pgindent.patchtext/x-diff; charset=us-asciiDownload

From d36a342906cb9002b9689c225b1412599c7b7aec Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 12 Mar 2020 17:51:56 -0300
Subject: [PATCH 4/8] pgindent

---
 src/backend/commands/explain.c             |  49 +--
 src/backend/executor/execAmi.c             |   5 +-
 src/backend/executor/execProcnode.c        |   2 +-
 src/backend/executor/nodeIncrementalSort.c | 384 +++++++++++----------
 src/backend/nodes/copyfuncs.c              |   6 +-
 src/backend/nodes/outfuncs.c               |   4 +-
 src/backend/optimizer/path/allpaths.c      |  10 +-
 src/backend/optimizer/path/costsize.c      |  40 +--
 src/backend/optimizer/path/pathkeys.c      |  10 +-
 src/backend/optimizer/plan/createplan.c    |  48 +--
 src/backend/optimizer/plan/planner.c       |  30 +-
 src/backend/optimizer/plan/setrefs.c       |   2 +-
 src/backend/optimizer/util/pathnode.c      |  28 +-
 src/backend/utils/misc/guc.c               |   2 +-
 src/backend/utils/sort/tuplesort.c         |  26 +-
 src/include/executor/nodeIncrementalSort.h |   2 +-
 src/include/nodes/execnodes.h              |  34 +-
 src/include/optimizer/cost.h               |  14 +-
 src/include/optimizer/pathnode.h           |  10 +-
 src/include/optimizer/paths.h              |   2 +-
 20 files changed, 364 insertions(+), 344 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8262c54e6a..2d6bf75521 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -83,7 +83,7 @@ static void show_upper_qual(List *qual, const char *qlabel,
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
 static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
-					   List *ancestors, ExplainState *es);
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -714,7 +714,7 @@ ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc)
 	 * further down in the plan tree.
 	 */
 	ps = queryDesc->planstate;
-	if (IsA(ps, GatherState) &&((Gather *) ps->plan)->invisible)
+	if (IsA(ps, GatherState) && ((Gather *) ps->plan)->invisible)
 	{
 		ps = outerPlanState(ps);
 		es->hide_workers = true;
@@ -2211,7 +2211,7 @@ show_scan_qual(List *qual, const char *qlabel,
 {
 	bool		useprefix;
 
-	useprefix = (IsA(planstate->plan, SubqueryScan) ||es->verbose);
+	useprefix = (IsA(planstate->plan, SubqueryScan) || es->verbose);
 	show_qual(qual, qlabel, planstate, ancestors, useprefix, es);
 }
 
@@ -2704,12 +2704,12 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 
 static void
 show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
-						   const char *groupLabel, ExplainState *es)
+								 const char *groupLabel, ExplainState *es)
 {
 	const char *sortMethodName;
 	const char *spaceTypeName;
-	ListCell *methodCell;
-	int methodCount = list_length(groupInfo->sortMethods);
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
 
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
@@ -2727,7 +2727,8 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 
 		if (groupInfo->maxMemorySpaceUsed > 0)
 		{
-			long avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+
 			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
 			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
 							 spaceTypeName, avgSpace,
@@ -2736,7 +2737,8 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 
 		if (groupInfo->maxDiskSpaceUsed > 0)
 		{
-			long avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
 			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
 			/* Add a semicolon separator only if memory stats were printed. */
 			if (groupInfo->maxMemorySpaceUsed > 0)
@@ -2750,7 +2752,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 	}
 	else
 	{
-		List *methodNames = NIL;
+		List	   *methodNames = NIL;
 		StringInfoData groupName;
 
 		initStringInfo(&groupName);
@@ -2767,21 +2769,21 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 
 		if (groupInfo->maxMemorySpaceUsed > 0)
 		{
-			long avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
 
 			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
 			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
-					groupInfo->maxMemorySpaceUsed, es);
+								   groupInfo->maxMemorySpaceUsed, es);
 			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
 			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
 		}
 		if (groupInfo->maxDiskSpaceUsed > 0)
 		{
-			long avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
 
 			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
 			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
-					groupInfo->maxDiskSpaceUsed, es);
+								   groupInfo->maxDiskSpaceUsed, es);
 			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
 			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
 		}
@@ -2818,23 +2820,24 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
 		{
 			IncrementalSortInfo *incsort_info =
-				&incrsortstate->shared_info->sinfo[n];
+			&incrsortstate->shared_info->sinfo[n];
+
 			/*
 			 * XXX: The previous version of the patch chcked:
 			 * fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
-			 * and continued if the condition was true (with the comment "ignore
-			 * any unfilled slots").
-			 * I'm not convinced that makes sense since the same sort instrument
-			 * can have been used multiple times, so the last time it being used
-			 * being still in progress, doesn't seem to be relevant.
-			 * Instead I'm now checking to see if the group count for each group
-			 * info is 0. If both are 0, then we exclude the worker since it
-			 * didn't contribute anything meaningful.
+			 * and continued if the condition was true (with the comment
+			 * "ignore any unfilled slots"). I'm not convinced that makes
+			 * sense since the same sort instrument can have been used
+			 * multiple times, so the last time it being used being still in
+			 * progress, doesn't seem to be relevant. Instead I'm now checking
+			 * to see if the group count for each group info is 0. If both are
+			 * 0, then we exclude the worker since it didn't contribute
+			 * anything meaningful.
 			 */
 			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
 			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
 			if (fullsortGroupInfo->groupCount == 0 &&
-					prefixsortGroupInfo->groupCount == 0)
+				prefixsortGroupInfo->groupCount == 0)
 				continue;
 
 			if (!opened_group)
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index cba648a95e..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -566,9 +566,10 @@ ExecSupportsBackwardScan(Plan *node)
 			return true;
 
 		case T_IncrementalSort:
+
 			/*
-			 * Unlike full sort, incremental sort keeps only a single group
-			 * of tuples in memory, so it can't scan backwards.
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
 			 */
 			return false;
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 8051f46a71..d15a86a706 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -859,7 +859,7 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 		 * good idea to integrate this signaling with the parameter-change
 		 * mechanism.
 		 */
-		IncrementalSortState  *sortState = (IncrementalSortState *) child_node;
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
 
 		if (tuples_needed < 0)
 		{
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index e2cb9511ba..4f6b438e7b 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -70,10 +70,10 @@
 
 static void
 instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
-	Tuplesortstate *sortState)
+					  Tuplesortstate *sortState)
 {
 	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
-	TuplesortInstrumentation	sort_instr;
+	TuplesortInstrumentation sort_instr;
 
 	groupInfo->groupCount++;
 
@@ -96,7 +96,7 @@ instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 
 	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
 		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
-				sort_instr.sortMethod);
+											 sort_instr.sortMethod);
 
 	/* Record shared stats if we're a parallel worker. */
 	if (node->shared_info && node->am_worker)
@@ -105,7 +105,7 @@ instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
 
 		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
-				&node->incsort_info, sizeof(IncrementalSortInfo));
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
 	}
 }
 
@@ -115,31 +115,31 @@ instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 static void
 preparePresortedCols(IncrementalSortState *node)
 {
-	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
-	int					presortedCols,
-						i;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	int			presortedCols,
+				i;
 
 	Assert(IsA(plannode, IncrementalSort));
 	presortedCols = plannode->presortedCols;
 
 	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
-													sizeof(PresortedKeyData));
+													   sizeof(PresortedKeyData));
 
 	/* Pre-cache comparison functions for each pre-sorted key. */
 	for (i = 0; i < presortedCols; i++)
 	{
-		Oid					equalityOp,
-							equalityFunc;
-		PresortedKeyData   *key;
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
 
 		key = &node->presorted_keys[i];
 		key->attno = plannode->sort.sortColIdx[i];
 
 		equalityOp = get_equality_op_for_ordering_op(
-										plannode->sort.sortOperators[i], NULL);
+													 plannode->sort.sortOperators[i], NULL);
 		if (!OidIsValid(equalityOp))
 			elog(ERROR, "missing equality operator for ordering operator %u",
-					plannode->sort.sortOperators[i]);
+				 plannode->sort.sortOperators[i]);
 
 		equalityFunc = get_opcode(equalityOp);
 		if (!OidIsValid(equalityFunc))
@@ -151,7 +151,7 @@ preparePresortedCols(IncrementalSortState *node)
 		/* We can initialize the callinfo just once and re-use it */
 		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
 		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
-								plannode->sort.collations[i], NULL, NULL);
+								 plannode->sort.collations[i], NULL, NULL);
 		key->fcinfo->args[0].isnull = false;
 		key->fcinfo->args[1].isnull = false;
 	}
@@ -167,27 +167,28 @@ preparePresortedCols(IncrementalSortState *node)
 static bool
 isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
 {
-	int presortedCols, i;
+	int			presortedCols,
+				i;
 
 	Assert(IsA(node->ss.ps.plan, IncrementalSort));
 
 	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
 
 	/*
-	 * That the input is sorted by keys * (0, ... n) implies that the tail keys
-	 * are more likely to change. Therefore we do our comparison starting from
-	 * the last pre-sorted column to optimize for early detection of
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
 	 * inequality and minimizing the number of function calls..
 	 */
 	for (i = presortedCols - 1; i >= 0; i--)
 	{
-		Datum				datumA,
-							datumB,
-							result;
-		bool				isnullA,
-							isnullB;
-		AttrNumber			attno = node->presorted_keys[i].attno;
-		PresortedKeyData   *key;
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
 
 		datumA = slot_getattr(pivot, attno, &isnullA);
 		datumB = slot_getattr(tuple, attno, &isnullB);
@@ -243,13 +244,13 @@ static void
 switchToPresortedPrefixMode(PlanState *pstate)
 {
 	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
-	ScanDirection		dir;
-	int64 nTuples = 0;
-	bool lastTuple = false;
-	bool firstTuple = true;
-	TupleDesc		    tupDesc;
-	PlanState		   *outerNode;
-	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
 
 	dir = node->ss.ps.state->es_direction;
 	outerNode = outerPlanState(node);
@@ -258,22 +259,22 @@ switchToPresortedPrefixMode(PlanState *pstate)
 	if (node->prefixsort_state == NULL)
 	{
 		Tuplesortstate *prefixsort_state;
-		int presortedCols = plannode->presortedCols;
+		int			presortedCols = plannode->presortedCols;
 
 		/*
-		 * Optimize the sort by assuming the prefix columns are all equal
-		 * and thus we only need to sort by any remaining columns.
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
 		 */
 		prefixsort_state = tuplesort_begin_heap(
-				tupDesc,
-				plannode->sort.numCols - presortedCols,
-				&(plannode->sort.sortColIdx[presortedCols]),
-				&(plannode->sort.sortOperators[presortedCols]),
-				&(plannode->sort.collations[presortedCols]),
-				&(plannode->sort.nullsFirst[presortedCols]),
-				work_mem,
-				NULL,
-				false);
+												tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
 		node->prefixsort_state = prefixsort_state;
 	}
 	else
@@ -284,15 +285,15 @@ switchToPresortedPrefixMode(PlanState *pstate)
 
 	/*
 	 * If the current node has a bound, then it's reasonably likely that a
-	 * large prefix key group will benefit from bounded sort, so configure
-	 * the tuplesort to allow for that optimization.
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
 	 */
 	if (node->bounded)
 	{
 		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
-				node->bound - node->bound_Done);
+				   node->bound - node->bound_Done);
 		tuplesort_set_bound(node->prefixsort_state,
-				node->bound - node->bound_Done);
+							node->bound - node->bound_Done);
 	}
 
 	for (;;)
@@ -315,12 +316,12 @@ switchToPresortedPrefixMode(PlanState *pstate)
 		else
 		{
 			tuplesort_gettupleslot(node->fullsort_state,
-					ScanDirectionIsForward(dir),
-					false, node->transfer_tuple, NULL);
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
 
 			/*
-			 * If this is our first time through the loop, then we need to save the
-			 * first tuple we get as our new group pivot.
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
 			 */
 			if (TupIsNull(node->group_pivot))
 				ExecCopySlot(node->group_pivot, node->transfer_tuple);
@@ -332,16 +333,18 @@ switchToPresortedPrefixMode(PlanState *pstate)
 			}
 			else
 			{
-				/* The tuple isn't part of the current batch so we need to carry
-				 * it over into the next set up tuples we transfer out of the full
-				 * sort tuplesort into the presorted prefix tuplesort. We don't
-				 * actually have to do anything special to save the tuple since
-				 * we've already loaded it into the node->transfer_tuple slot, and,
-				 * even though that slot points to memory inside the full sort
-				 * tuplesort, we can't reset that tuplesort anyway until we've
-				 * fully transferred out of its tuples, so this reference is safe.
-				 * We do need to reset the group pivot tuple though since we've
-				 * finished the current prefix key group.
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next set up tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
 				 */
 				ExecClearTuple(node->group_pivot);
 				break;
@@ -351,6 +354,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 		firstTuple = false;
 
 		if (lastTuple)
+
 			/*
 			 * We retain the current group pivot tuple since we haven't yet
 			 * found the end of the current prefix key group.
@@ -371,9 +375,9 @@ switchToPresortedPrefixMode(PlanState *pstate)
 	if (lastTuple)
 	{
 		/*
-		 * We've confirmed that all tuples remaining in the full sort batch
-		 * is in the same prefix key group and moved all of those tuples into
-		 * the presorted prefix tuplesort. Now we can save our pivot comparison
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
 		 * tuple and continue fetching tuples from the outer execution node to
 		 * load into the presorted prefix tuplesort.
 		 */
@@ -381,9 +385,10 @@ switchToPresortedPrefixMode(PlanState *pstate)
 		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
 		node->execution_status = INCSORT_LOADPREFIXSORT;
 
-		/* Make sure we clear the transfer tuple slot so that next time we
-		 * encounter a large prefix key group we don't incorrectly assume
-		 * we have a tuple carried over from the previous group.
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
 		 */
 		ExecClearTuple(node->transfer_tuple);
 	}
@@ -391,28 +396,28 @@ switchToPresortedPrefixMode(PlanState *pstate)
 	{
 		/*
 		 * We finished a group but didn't consume all of the tuples from the
-		 * full sort batch sorter, so we'll sort this batch, let the inner node
-		 * read out all of those tuples, and then come back around to find
-		 * another batch.
+		 * full sort batch sorter, so we'll sort this batch, let the inner
+		 * node read out all of those tuples, and then come back around to
+		 * find another batch.
 		 */
 		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
 		tuplesort_performsort(node->prefixsort_state);
 
 		if (pstate->instrument != NULL)
 			instrumentSortedGroup(pstate,
-					&node->incsort_info.prefixsortGroupInfo,
-					node->prefixsort_state);
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
 
 		if (node->bounded)
 		{
 			/*
 			 * If the current node has a bound, and we've already sorted n
-			 * tuples, then the functional bound remaining is
-			 * (original bound - n), so store the current number of processed
-			 * tuples for use in configuring sorting bound.
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
 			 */
 			SO2_printf("Changing bound_Done from %ld to %ld\n",
-					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
 		}
 
@@ -477,16 +482,16 @@ static TupleTableSlot *
 ExecIncrementalSort(PlanState *pstate)
 {
 	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
-	EState			   *estate;
-	ScanDirection		dir;
-	Tuplesortstate	   *read_sortstate;
-	Tuplesortstate	   *fullsort_state;
-	TupleTableSlot	   *slot;
-	IncrementalSort	   *plannode = (IncrementalSort *) node->ss.ps.plan;
-	PlanState		   *outerNode;
-	TupleDesc			tupDesc;
-	int64				nTuples = 0;
-	int64				minGroupSize;
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -495,7 +500,7 @@ ExecIncrementalSort(PlanState *pstate)
 	fullsort_state = node->fullsort_state;
 
 	if (node->execution_status == INCSORT_READFULLSORT
-			|| node->execution_status == INCSORT_READPREFIXSORT)
+		|| node->execution_status == INCSORT_READPREFIXSORT)
 	{
 		/*
 		 * Return next tuple from the current sorted group set if available.
@@ -505,35 +510,37 @@ ExecIncrementalSort(PlanState *pstate)
 		slot = node->ss.ps.ps_ResultTupleSlot;
 		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
 								   false, slot, NULL) || node->finished)
+
 			/*
-			 * TODO: there isn't a good test case for the node->finished
-			 * case directly, but lots of other stuff fails if it's not
-			 * there. If the outer node will fail when trying to fetch
-			 * too many tuples, then things break if that test isn't here.
+			 * TODO: there isn't a good test case for the node->finished case
+			 * directly, but lots of other stuff fails if it's not there. If
+			 * the outer node will fail when trying to fetch too many tuples,
+			 * then things break if that test isn't here.
 			 */
 			return slot;
 		else if (node->n_fullsort_remaining > 0)
 		{
 			/*
 			 * When we transition to presorted prefix mode, we might have
-			 * accumulated at least one additional prefix key group in the full
-			 * sort tuplesort. The first call to switchToPresortedPrefixMode()
-			 * pulled the one of those groups out, and we've returned those
-			 * tuples to the inner node, but if we tuples remaining in that
-			 * tuplesort (i.e., n_fullsort_remaining > 0) at this point we
-			 * need to do that again.
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() pulled the one of those groups
+			 * out, and we've returned those tuples to the inner node, but if
+			 * we tuples remaining in that tuplesort (i.e.,
+			 * n_fullsort_remaining > 0) at this point we need to do that
+			 * again.
 			 */
 			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
-					node->n_fullsort_remaining);
+					   node->n_fullsort_remaining);
 			switchToPresortedPrefixMode(pstate);
 		}
 		else
 		{
 			/*
-			 * If we don't have any already sorted tuples to read, and we're not
-			 * in the middle of transitioning into presorted prefix sort mode,
-			 * then it's time to start the process all over again by building
-			 * new full sort group.
+			 * If we don't have any already sorted tuples to read, and we're
+			 * not in the middle of transitioning into presorted prefix sort
+			 * mode, then it's time to start the process all over again by
+			 * building new full sort group.
 			 */
 			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
 			node->execution_status = INCSORT_LOADFULLSORT;
@@ -541,8 +548,8 @@ ExecIncrementalSort(PlanState *pstate)
 	}
 
 	/*
-	 * Want to scan subplan in the forward direction while creating the
-	 * sorted data.
+	 * Want to scan subplan in the forward direction while creating the sorted
+	 * data.
 	 */
 	estate->es_direction = ForwardScanDirection;
 
@@ -570,15 +577,15 @@ ExecIncrementalSort(PlanState *pstate)
 			 * columns.
 			 */
 			fullsort_state = tuplesort_begin_heap(
-					tupDesc,
-					plannode->sort.numCols,
-					plannode->sort.sortColIdx,
-					plannode->sort.sortOperators,
-					plannode->sort.collations,
-					plannode->sort.nullsFirst,
-					work_mem,
-					NULL,
-					false);
+												  tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
 			node->fullsort_state = fullsort_state;
 		}
 		else
@@ -593,13 +600,13 @@ ExecIncrementalSort(PlanState *pstate)
 		 */
 		if (node->bounded)
 		{
-			int64 currentBound = node->bound - node->bound_Done;
+			int64		currentBound = node->bound - node->bound_Done;
 
 			/*
 			 * Bounded sort isn't likely to be a useful optimization for full
 			 * sort mode since we limit full sort mode to a relatively small
-			 * number of tuples and tuplesort doesn't switch over to top-n heap
-			 * sort anyway unless it hits (2 * bound) tuples.
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
 			 */
 			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
 				tuplesort_set_bound(fullsort_state, currentBound);
@@ -609,9 +616,11 @@ ExecIncrementalSort(PlanState *pstate)
 		else
 			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
 
-		/* Because we have to read the next tuple to find out that we've
+		/*
+		 * Because we have to read the next tuple to find out that we've
 		 * encountered a new prefix key group on subsequent groups we have to
-		 * carry over that extra tuple and add it to the new group's sort here.
+		 * carry over that extra tuple and add it to the new group's sort
+		 * here.
 		 */
 		if (!TupIsNull(node->group_pivot))
 		{
@@ -620,10 +629,10 @@ ExecIncrementalSort(PlanState *pstate)
 
 			/*
 			 * We're in full sort mode accumulating a minimum number of tuples
-			 * and not checking for prefix key equality yet, so we can't assume
-			 * the group pivot tuple will reamin the same -- unless we're using
-			 * a minimum group size of 1, in which case the pivot is obviously
-			 * still the pviot.
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
 			 */
 			if (nTuples != minGroupSize)
 				ExecClearTuple(node->group_pivot);
@@ -651,8 +660,8 @@ ExecIncrementalSort(PlanState *pstate)
 
 				if (pstate->instrument != NULL)
 					instrumentSortedGroup(pstate,
-							&node->incsort_info.fullsortGroupInfo,
-							fullsort_state);
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
 
 				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
 				node->execution_status = INCSORT_READFULLSORT;
@@ -663,9 +672,10 @@ ExecIncrementalSort(PlanState *pstate)
 			if (nTuples < minGroupSize)
 			{
 				/*
-				 * If we have yet hit our target minimum group size, then don't
-				 * both with checking for inclusion in the current prefix group
-				 * since a large number of very tiny sorts is inefficient.
+				 * If we have yet hit our target minimum group size, then
+				 * don't both with checking for inclusion in the current
+				 * prefix group since a large number of very tiny sorts is
+				 * inefficient.
 				 */
 				tuplesort_puttupleslot(fullsort_state, slot);
 				nTuples++;
@@ -695,11 +705,11 @@ ExecIncrementalSort(PlanState *pstate)
 				{
 					/*
 					 * Since the tuple we fetched isn't part of the current
-					 * prefix key group we can't sort it as part of this
-					 * sort group. Instead we need to carry it over to the
-					 * next group. We use the group_pivot slot as a temp
-					 * container for that purpose even though we won't actually
-					 * treat it as a group pivot.
+					 * prefix key group we can't sort it as part of this sort
+					 * group. Instead we need to carry it over to the next
+					 * group. We use the group_pivot slot as a temp container
+					 * for that purpose even though we won't actually treat it
+					 * as a group pivot.
 					 */
 					ExecCopySlot(node->group_pivot, slot);
 
@@ -707,13 +717,13 @@ ExecIncrementalSort(PlanState *pstate)
 					{
 						/*
 						 * If the current node has a bound, and we've already
-						 * sorted n tuples, then the functional bound remaining
-						 * is (original bound - n), so store the current number
-						 * of processed tuples for use in configuring sorting
-						 * bound.
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for use in
+						 * configuring sorting bound.
 						 */
 						SO2_printf("Changing bound_Done from %ld to %ld\n",
-								Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+								   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
 						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
 					}
 
@@ -726,8 +736,8 @@ ExecIncrementalSort(PlanState *pstate)
 
 					if (pstate->instrument != NULL)
 						instrumentSortedGroup(pstate,
-								&node->incsort_info.fullsortGroupInfo,
-								fullsort_state);
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
 
 					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
 					node->execution_status = INCSORT_READFULLSORT;
@@ -737,17 +747,17 @@ ExecIncrementalSort(PlanState *pstate)
 
 			/*
 			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
-			 * then we make the assumption that it's likely that we've found
-			 * a large group of tuples having a single prefix key (as long
-			 * as the last tuple didn't shift us into reading from the full
-			 * sort mode tuplesort).
+			 * then we make the assumption that it's likely that we've found a
+			 * large group of tuples having a single prefix key (as long as
+			 * the last tuple didn't shift us into reading from the full sort
+			 * mode tuplesort).
 			 */
 			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
-					node->execution_status != INCSORT_READFULLSORT)
+				node->execution_status != INCSORT_READFULLSORT)
 			{
 				/*
-				 * The group pivot we have stored has already been put into the
-				 * tuplesort; we don't want to carry it over.
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over.
 				 */
 				ExecClearTuple(node->group_pivot);
 
@@ -762,33 +772,36 @@ ExecIncrementalSort(PlanState *pstate)
 				tuplesort_performsort(fullsort_state);
 				if (pstate->instrument != NULL)
 					instrumentSortedGroup(pstate,
-							&node->incsort_info.fullsortGroupInfo,
-							fullsort_state);
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
 
 				/*
-				 * If the full sort tuplesort happened to switch into top-n heapsort mode
-				 * then we will only be able to retrieve currentBound tuples (since the
-				 * tuplesort will have only retained the top-n tuples). This is safe even
-				 * though we haven't yet completed fetching the current prefix key group
-				 * because the tuples we've "lost" already sorted "below" the retained ones,
-				 * and we're already contractually guaranteed to not need any more than the
-				 * currentBount tuples.
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBount tuples.
 				 */
 				if (tuplesort_used_bound(node->fullsort_state))
 				{
-					int64 currentBound = node->bound - node->bound_Done;
+					int64		currentBound = node->bound - node->bound_Done;
+
 					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
-							nTuples, Min(currentBound, nTuples));
+							   nTuples, Min(currentBound, nTuples));
 					nTuples = Min(currentBound, nTuples);
 				}
 
 				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
-						nTuples);
+						   nTuples);
 
 				/*
-				 * Track the number of tuples we need to move from the fullsort
-				 * to presorted prefix sort (we might have multiple prefix key
-				 * groups, so we need a way to see if we've actually finished).
+				 * Track the number of tuples we need to move from the
+				 * fullsort to presorted prefix sort (we might have multiple
+				 * prefix key groups, so we need a way to see if we've
+				 * actually finished).
 				 */
 				node->n_fullsort_remaining = nTuples;
 
@@ -800,11 +813,12 @@ ExecIncrementalSort(PlanState *pstate)
 				 * tuplesort, we know that unless that transition has verified
 				 * that all tuples belonged to the same prefix key group (in
 				 * which case we can go straight to continuing to load tuples
-				 * into that tuplesort), we should have a tuple to return here.
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
 				 *
 				 * Either way, the appropriate execution status should have
-				 * been set by switchToPresortedPrefixMode(), so we can drop out
-				 * of the loop here and let the appropriate path kick in.
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
 				 */
 				break;
 			}
@@ -815,9 +829,9 @@ ExecIncrementalSort(PlanState *pstate)
 	{
 		/*
 		 * Since we only enter this state after determining that all remaining
-		 * tuples in the full sort tuplesort have the same prefix, we've already
-		 * established a current group pivot tuple (but wasn't carried over;
-		 * it's already been put into the prefix sort tuplesort).
+		 * tuples in the full sort tuplesort have the same prefix, we've
+		 * already established a current group pivot tuple (but wasn't carried
+		 * over; it's already been put into the prefix sort tuplesort).
 		 */
 		Assert(!TupIsNull(node->group_pivot));
 
@@ -835,9 +849,10 @@ ExecIncrementalSort(PlanState *pstate)
 			if (isCurrentGroup(node, node->group_pivot, slot))
 			{
 				/*
-				 * Fetch tuples and put them into the presorted prefix tuplesort
-				 * until we find changed prefix keys. Only then can we guarantee
-				 * sort stability of the tuples we've already accumulated.
+				 * Fetch tuples and put them into the presorted prefix
+				 * tuplesort until we find changed prefix keys. Only then can
+				 * we guarantee sort stability of the tuples we've already
+				 * accumulated.
 				 */
 				tuplesort_puttupleslot(node->prefixsort_state, slot);
 				nTuples++;
@@ -862,8 +877,8 @@ ExecIncrementalSort(PlanState *pstate)
 
 		if (pstate->instrument != NULL)
 			instrumentSortedGroup(pstate,
-					&node->incsort_info.prefixsortGroupInfo,
-					node->prefixsort_state);
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
 
 		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
 		node->execution_status = INCSORT_READPREFIXSORT;
@@ -872,12 +887,12 @@ ExecIncrementalSort(PlanState *pstate)
 		{
 			/*
 			 * If the current node has a bound, and we've already sorted n
-			 * tuples, then the functional bound remaining is
-			 * (original bound - n), so store the current number of processed
-			 * tuples for use in configuring sorting bound.
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
 			 */
 			SO2_printf("Changing bound_Done from %ld to %ld\n",
-					Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
 		}
 	}
@@ -913,7 +928,7 @@ ExecIncrementalSort(PlanState *pstate)
 IncrementalSortState *
 ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 {
-	IncrementalSortState   *incrsortstate;
+	IncrementalSortState *incrsortstate;
 
 	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
 
@@ -948,9 +963,10 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 	if (incrsortstate->ss.ps.instrument != NULL)
 	{
 		IncrementalSortGroupInfo *fullsortGroupInfo =
-			&incrsortstate->incsort_info.fullsortGroupInfo;
+		&incrsortstate->incsort_info.fullsortGroupInfo;
 		IncrementalSortGroupInfo *prefixsortGroupInfo =
-			&incrsortstate->incsort_info.prefixsortGroupInfo;
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
 		fullsortGroupInfo->groupCount = 0;
 		fullsortGroupInfo->maxDiskSpaceUsed = 0;
 		fullsortGroupInfo->totalDiskSpaceUsed = 0;
@@ -988,17 +1004,17 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
 
 	/*
-	 * Initialize return slot and type. No need to initialize projection info because
-	 * this node doesn't do projections.
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because this node doesn't do projections.
 	 */
 	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
 
 	/* make standalone slot to store previous tuple from outer node */
 	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
-							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+														  ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
 	incrsortstate->transfer_tuple = MakeSingleTupleTableSlot(
-							ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+															 ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
 
 	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d2b9bd95ba..12d5d4523d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -968,7 +968,7 @@ _copySort(const Sort *from)
 static IncrementalSort *
 _copyIncrementalSort(const IncrementalSort *from)
 {
-	IncrementalSort	   *newnode = makeNode(IncrementalSort);
+	IncrementalSort *newnode = makeNode(IncrementalSort);
 
 	/*
 	 * copy node superclass fields
@@ -3541,7 +3541,7 @@ _copyCreateStatsStmt(const CreateStatsStmt *from)
 }
 
 static AlterStatsStmt *
-_copyAlterStatsStmt(const AlterStatsStmt *from)
+_copyAlterStatsStmt(const AlterStatsStmt * from)
 {
 	AlterStatsStmt *newnode = makeNode(AlterStatsStmt);
 
@@ -3671,7 +3671,7 @@ _copyAlterOperatorStmt(const AlterOperatorStmt *from)
 }
 
 static AlterTypeStmt *
-_copyAlterTypeStmt(const AlterTypeStmt *from)
+_copyAlterTypeStmt(const AlterTypeStmt * from)
 {
 	AlterTypeStmt *newnode = makeNode(AlterTypeStmt);
 
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 6c83372c9f..5620b24ac5 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2694,7 +2694,7 @@ _outCreateStatsStmt(StringInfo str, const CreateStatsStmt *node)
 }
 
 static void
-_outAlterStatsStmt(StringInfo str, const AlterStatsStmt *node)
+_outAlterStatsStmt(StringInfo str, const AlterStatsStmt * node)
 {
 	WRITE_NODE_TYPE("ALTERSTATSSTMT");
 
@@ -3670,7 +3670,7 @@ outNode(StringInfo str, const void *obj)
 
 	if (obj == NULL)
 		appendStringInfoString(str, "<>");
-	else if (IsA(obj, List) ||IsA(obj, IntList) || IsA(obj, OidList))
+	else if (IsA(obj, List) || IsA(obj, IntList) || IsA(obj, OidList))
 		_outList(str, obj);
 	else if (IsA(obj, Integer) ||
 			 IsA(obj, Float) ||
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8d9c25e18f..e0bb71dd51 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2804,9 +2804,9 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 		}
 
 		/*
-		 * This ends up allowing us to do incremental sort on top of
-		 * an index scan all parallelized under a gather merge node.
-		*/
+		 * This ends up allowing us to do incremental sort on top of an index
+		 * scan all parallelized under a gather merge node.
+		 */
 		if (query_pathkeys_ok)
 			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
 	}
@@ -2897,7 +2897,7 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 			 */
 			if (cheapest_partial_path == subpath)
 			{
-				Path *tmp;
+				Path	   *tmp;
 
 				tmp = (Path *) create_sort_path(root,
 												rel,
@@ -2922,7 +2922,7 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 			/* finally, consider incremental sort */
 			if (presorted_keys > 0)
 			{
-				Path *tmp;
+				Path	   *tmp;
 
 				/* Also consider incremental sort. */
 				tmp = (Path *) create_incremental_sort_path(root,
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d1748d1011..152e016228 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1683,9 +1683,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  */
 static void
 cost_tuplesort(Cost *startup_cost, Cost *run_cost,
-		  double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
@@ -1810,10 +1810,10 @@ cost_full_sort(Cost *startup_cost, Cost *run_cost,
  */
 void
 cost_incremental_sort(Path *path,
-		  PlannerInfo *root, List *pathkeys, int presorted_keys,
-		  Cost input_startup_cost, Cost input_total_cost,
-		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
 {
 	Cost		startup_cost = 0,
 				run_cost = 0,
@@ -1839,9 +1839,9 @@ cost_incremental_sort(Path *path,
 	/* Extract presorted keys as list of expressions */
 	foreach(l, pathkeys)
 	{
-		PathKey *key = (PathKey *)lfirst(l);
+		PathKey    *key = (PathKey *) lfirst(l);
 		EquivalenceMember *member = (EquivalenceMember *)
-						linitial(key->pk_eclass->ec_members);
+		linitial(key->pk_eclass->ec_members);
 
 		presortedExprs = lappend(presortedExprs, member->em_expr);
 
@@ -1856,11 +1856,11 @@ cost_incremental_sort(Path *path,
 	group_input_run_cost = input_run_cost / input_groups;
 
 	/*
-	 * Estimate average cost of sorting of one group where presorted keys
-	 * are equal.  Incremental sort is sensitive to distribution of tuples
-	 * to the groups, where we're relying on quite rough assumptions.  Thus,
-	 * we're pessimistic about incremental sort performance and increase
-	 * its average group size by half.
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
 	 */
 	cost_tuplesort(&group_startup_cost, &group_run_cost,
 				   1.5 * group_tuples, width, comparison_cost, sort_mem,
@@ -1875,9 +1875,9 @@ cost_incremental_sort(Path *path,
 
 	/*
 	 * After we started producing tuples from the first group, the cost of
-	 * producing all the tuples is given by the cost to finish processing
-	 * this group, plus the total cost to process the remaining groups,
-	 * plus the remaining cost of input.
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
 	 */
 	run_cost += group_run_cost
 		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
@@ -1916,8 +1916,8 @@ cost_sort(Path *path, PlannerInfo *root,
 		  double limit_tuples)
 
 {
-	Cost startup_cost;
-	Cost run_cost;
+	Cost		startup_cost;
+	Cost		run_cost;
 
 	cost_full_sort(&startup_cost, &run_cost,
 				   input_cost,
@@ -3173,7 +3173,7 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
 	 * The whole issue is moot if we are working from a unique-ified outer
 	 * input, or if we know we don't need to mark/restore at all.
 	 */
-	if (IsA(outer_path, UniquePath) ||path->skip_mark_restore)
+	if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
 		rescannedtuples = 0;
 	else
 	{
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 6e2ba08d7b..74799cd8fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -372,7 +372,7 @@ pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
 int
 pathkeys_common(List *keys1, List *keys2)
 {
-	int		n;
+	int			n;
 
 	(void) pathkeys_common_contained_in(keys1, keys2, &n);
 	return n;
@@ -1838,7 +1838,7 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
-	int	n_common_pathkeys;
+	int			n_common_pathkeys;
 
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
@@ -1850,9 +1850,9 @@ pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 										&n_common_pathkeys);
 
 	/*
-	 * Return the number of path keys in common, or 0 if there are none.
-	 * Any leading common pathkeys could be useful for ordering because
-	 * we can use the incremental sort.
+	 * Return the number of path keys in common, or 0 if there are none. Any
+	 * leading common pathkeys could be useful for ordering because we can use
+	 * the incremental sort.
 	 */
 	return n_common_pathkeys;
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 53d08aed2e..026a60b946 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -99,7 +99,7 @@ static Plan *create_projection_plan(PlannerInfo *root,
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
 static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
-									IncrementalSortPath *best_path, int flags);
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -247,9 +247,9 @@ static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
 static IncrementalSort *make_incrementalsort(Plan *lefttree,
-		  int numCols, int presortedCols,
-		  AttrNumber *sortColIdx, Oid *sortOperators,
-		  Oid *collations, bool *nullsFirst);
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -265,7 +265,7 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
 static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
-						List *pathkeys, Relids relids, int presortedCols);
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -2016,17 +2016,17 @@ static IncrementalSort *
 create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 							int flags)
 {
-	IncrementalSort	   *plan;
-	Plan			   *subplan;
+	IncrementalSort *plan;
+	Plan	   *subplan;
 
 	/* See comments in create_sort_plan() above */
 	subplan = create_plan_recurse(root, best_path->spath.subpath,
 								  flags | CP_SMALL_TLIST);
 	plan = make_incrementalsort_from_pathkeys(subplan,
-								best_path->spath.path.pathkeys,
-								IS_OTHER_REL(best_path->spath.subpath->parent) ?
-								best_path->spath.path.parent->relids : NULL,
-								best_path->presortedCols);
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
@@ -5131,18 +5131,18 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 				run_cost;
 
 	/*
-	 * This function shouldn't have to deal with IncrementalSort plans
-	 * because they are only created from corresponding Path nodes.
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
 	 */
 	Assert(IsA(plan, Sort));
 
 	cost_full_sort(&startup_cost, &run_cost,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
 	plan->plan.startup_cost = startup_cost;
 	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
@@ -5750,11 +5750,11 @@ make_sort(Plan *lefttree, int numCols,
  */
 static IncrementalSort *
 make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
-		  AttrNumber *sortColIdx, Oid *sortOperators,
-		  Oid *collations, bool *nullsFirst)
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
 {
-	IncrementalSort	   *node;
-	Plan			   *plan;
+	IncrementalSort *node;
+	Plan	   *plan;
 
 	node = makeNode(IncrementalSort);
 
@@ -6130,7 +6130,7 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
  */
 static IncrementalSort *
 make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
-						Relids relids, int presortedCols)
+								   Relids relids, int presortedCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 15223017c0..330a6a2f6c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4868,7 +4868,7 @@ create_distinct_paths(PlannerInfo *root,
 	else
 	{
 		Size		hashentrysize = hash_agg_entry_size(
-			0, cheapest_input_path->pathtarget->width, 0);
+														0, cheapest_input_path->pathtarget->width, 0);
 
 		/* Allow hashing only if hashtable is predicted to fit in work_mem */
 		allow_hash = (hashentrysize * numDistinctRows <= work_mem * 1024L);
@@ -4986,8 +4986,8 @@ create_ordered_paths(PlannerInfo *root,
 			if (input_path == cheapest_input_path)
 			{
 				/*
-				 * Sort the cheapest input path. An explicit sort here can take
-				 * advantage of LIMIT.
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
 				 */
 				sorted_path = (Path *) create_sort_path(root,
 														ordered_rel,
@@ -5079,9 +5079,9 @@ create_ordered_paths(PlannerInfo *root,
 		 */
 		if (enable_incrementalsort)
 		{
-			ListCell *lc;
+			ListCell   *lc;
 
-			foreach (lc, input_rel->partial_pathlist)
+			foreach(lc, input_rel->partial_pathlist)
 			{
 				Path	   *input_path = (Path *) lfirst(lc);
 				Path	   *sorted_path = input_path;
@@ -5091,9 +5091,9 @@ create_ordered_paths(PlannerInfo *root,
 
 				/*
 				 * We don't care if this is the cheapest partial path - we
-				 * can't simply skip it, because it may be partially sorted
-				 * in which case we want to consider incremental sort on top
-				 * of it (instead of full sort, which is what happens above).
+				 * can't simply skip it, because it may be partially sorted in
+				 * which case we want to consider incremental sort on top of
+				 * it (instead of full sort, which is what happens above).
 				 */
 
 				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
@@ -6586,8 +6586,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			else if (parse->hasAggs)
 			{
 				/*
-				 * We have aggregation, possibly with plain GROUP BY. Make
-				 * an AggPath.
+				 * We have aggregation, possibly with plain GROUP BY. Make an
+				 * AggPath.
 				 */
 				add_path(grouped_rel, (Path *)
 						 create_agg_path(root,
@@ -6604,8 +6604,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			else if (parse->groupClause)
 			{
 				/*
-				 * We have GROUP BY without aggregation or grouping sets.
-				 * Make a GroupPath.
+				 * We have GROUP BY without aggregation or grouping sets. Make
+				 * a GroupPath.
 				 */
 				add_path(grouped_rel, (Path *)
 						 create_group_path(root,
@@ -6676,8 +6676,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 
 				/*
 				 * Now we may consider incremental sort on this path, but only
-				 * when the path is not already sorted and when incremental sort
-				 * is enabled.
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
 				 */
 				if (is_sorted || !enable_incrementalsort)
 					continue;
@@ -7273,7 +7273,7 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 		return;
 
 	/* also consider incremental sort on partial paths, if enabled */
-	foreach (lc, rel->partial_pathlist)
+	foreach(lc, rel->partial_pathlist)
 	{
 		Path	   *path = (Path *) lfirst(lc);
 		bool		is_sorted;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 2b676bf406..baefe0e946 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1960,7 +1960,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 static void
 set_param_references(PlannerInfo *root, Plan *plan)
 {
-	Assert(IsA(plan, Gather) ||IsA(plan, GatherMerge));
+	Assert(IsA(plan, Gather) || IsA(plan, GatherMerge));
 
 	if (plan->lefttree->extParam)
 	{
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 88402a9033..35773cc2c7 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2627,7 +2627,7 @@ apply_projection_to_path(PlannerInfo *root,
 	 * workers can help project.  But if there is something that is not
 	 * parallel-safe in the target expressions, then we can't.
 	 */
-	if ((IsA(path, GatherPath) ||IsA(path, GatherMergePath)) &&
+	if ((IsA(path, GatherPath) || IsA(path, GatherMergePath)) &&
 		is_parallel_safe(root, (Node *) target->exprs))
 	{
 		/*
@@ -2755,14 +2755,14 @@ create_set_projection_path(PlannerInfo *root,
  */
 SortPath *
 create_incremental_sort_path(PlannerInfo *root,
-				 RelOptInfo *rel,
-				 Path *subpath,
-				 List *pathkeys,
-				 int presorted_keys,
-				 double limit_tuples)
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
 {
 	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
-	SortPath *pathnode = &sort->spath;
+	SortPath   *pathnode = &sort->spath;
 
 	pathnode->path.pathtype = T_IncrementalSort;
 	pathnode->path.parent = rel;
@@ -2779,13 +2779,13 @@ create_incremental_sort_path(PlannerInfo *root,
 	pathnode->subpath = subpath;
 
 	cost_incremental_sort(&pathnode->path,
-			  root, pathkeys, presorted_keys,
-			  subpath->startup_cost,
-			  subpath->total_cost,
-			  subpath->rows,
-			  subpath->pathtarget->width,
-			  0.0,				/* XXX comparison_cost shouldn't be 0? */
-			  work_mem, limit_tuples);
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
 
 	sort->presortedCols = presorted_keys;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4949ef2079..ca34552687 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -11630,7 +11630,7 @@ check_backtrace_functions(char **newval, void **extra, GucSource source)
 		else if ((*newval)[i] == ' ' ||
 				 (*newval)[i] == '\n' ||
 				 (*newval)[i] == '\t')
-			;	/* ignore these */
+			;					/* ignore these */
 		else
 			someval[j++] = (*newval)[i];	/* copy anything else */
 	}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index c2bd38f39f..2c2efff0a6 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -251,13 +251,13 @@ struct Tuplesortstate
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
 	int64		maxSpace;		/* maximum amount of space occupied among sort
-								   of groups, either in-memory or on-disk */
-	bool		maxSpaceOnDisk;	/* true when maxSpace is value for on-disk
-								   space, false when it's value for in-memory
-								   space */
-	TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
-	MemoryContext maincontext;	/* memory context for tuple sort metadata
-								   that persist across multiple batches */
+								 * of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata that
+								 * persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -1350,8 +1350,8 @@ tuplesort_end(Tuplesortstate *state)
 static void
 tuplesort_updatemax(Tuplesortstate *state)
 {
-	int64	spaceUsed;
-	bool	spaceUsedOnDisk;
+	int64		spaceUsed;
+	bool		spaceUsedOnDisk;
 
 	/*
 	 * Note: it might seem we should provide both memory and disk usage for a
@@ -1374,9 +1374,9 @@ tuplesort_updatemax(Tuplesortstate *state)
 	}
 
 	/*
-	 * Sort evicts data to the disk when it didn't manage to fit those data
-	 * to the main memory.  This is why we assume space used on the disk to
-	 * be more important for tracking resource usage than space used in memory.
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
 	 * Note that amount of space occupied by some tuple set on the disk might
 	 * be less than amount of space occupied by the same tuple set in the
 	 * memory due to more compact representation.
@@ -2775,7 +2775,7 @@ mergeruns(Tuplesortstate *state)
 	 */
 	state->memtupsize = numInputTapes;
 	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
-										numInputTapes * sizeof(SortTuple));
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
index 3113989272..e62c02a4f3 100644
--- a/src/include/executor/nodeIncrementalSort.h
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -25,4 +25,4 @@ extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, Paralle
 extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
 extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
 
-#endif   /* NODEINCREMENTALSORT_H */
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0934482123..c96f03e48d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1989,9 +1989,9 @@ typedef struct MaterialState
  */
 typedef struct PresortedKeyData
 {
-	FmgrInfo				flinfo;	/* comparison function info */
-	FunctionCallInfo	fcinfo; /* comparison function call info */
-	OffsetNumber			attno;	/* attribute number in tuple */
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
 } PresortedKeyData;
 
 /* ----------------
@@ -2024,12 +2024,12 @@ typedef struct SortState
 
 typedef struct IncrementalSortGroupInfo
 {
-	int64 groupCount;
-	long maxDiskSpaceUsed;
-	long totalDiskSpaceUsed;
-	long maxMemorySpaceUsed;
-	long totalMemorySpaceUsed;
-	List *sortMethods;
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
 } IncrementalSortGroupInfo;
 
 typedef struct IncrementalSortInfo
@@ -2044,8 +2044,8 @@ typedef struct IncrementalSortInfo
  */
 typedef struct SharedIncrementalSortInfo
 {
-	int							num_workers;
-	IncrementalSortInfo			sinfo[FLEXIBLE_ARRAY_MEMBER];
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
 } SharedIncrementalSortInfo;
 
 /* ----------------
@@ -2066,13 +2066,13 @@ typedef struct IncrementalSortState
 	bool		bounded;		/* is the result set bounded? */
 	int64		bound;			/* if bounded, how many tuples are needed */
 	bool		sort_Done;		/* sort completed yet? */
-	bool		finished;		/* fetching tuples from outer node
-								   is finished ? */
+	bool		finished;		/* fetching tuples from outer node is finished
+								 * ? */
 	int64		bound_Done;		/* value of bound we did the sort with */
 	IncrementalSortExecutionStatus execution_status;
-	int64			n_fullsort_remaining;
-	Tuplesortstate	   *fullsort_state; /* private state of tuplesort.c */
-	Tuplesortstate	   *prefixsort_state; /* private state of tuplesort.c */
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
 	/* the keys by which the input path is already sorted */
 	PresortedKeyData *presorted_keys;
 
@@ -2082,7 +2082,7 @@ typedef struct IncrementalSortState
 	TupleTableSlot *group_pivot;
 	TupleTableSlot *transfer_tuple;
 	bool		am_worker;		/* are we a worker? */
-	SharedIncrementalSortInfo *shared_info;	/* one entry per worker */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
 } IncrementalSortState;
 
 /* ---------------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index bfee4db721..34f18bd73a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,14 +103,14 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
 extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
-			   Cost input_total_cost, double tuples, int width,
-			   Cost comparison_cost, int sort_mem,
-			   double limit_tuples);
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
 extern void cost_incremental_sort(Path *path,
-		  PlannerInfo *root, List *pathkeys, int presorted_keys,
-		  Cost input_startup_cost, Cost input_total_cost,
-		  double input_tuples, int width, Cost comparison_cost, int sort_mem,
-		  double limit_tuples);
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 57ecbbb01c..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -185,11 +185,11 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  Path *subpath,
 												  PathTarget *target);
 extern SortPath *create_incremental_sort_path(PlannerInfo *root,
-				 RelOptInfo *rel,
-				 Path *subpath,
-				 List *pathkeys,
-				 int presorted_keys,
-				 double limit_tuples);
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index d778b884a9..f6994779de 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -191,7 +191,7 @@ typedef enum
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
 extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
-extern int pathkeys_common(List *keys1, List *keys2);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
-- 
2.20.1

0005-no-n-after-left-parens-see-c9d297751959.patchtext/x-diff; charset=us-asciiDownload

From b33b70810d94950f4c3e20fb0fb01e103a25cf11 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 12 Mar 2020 18:25:46 -0300
Subject: [PATCH 5/8] no \n after left parens, see c9d297751959

---
 src/backend/executor/nodeIncrementalSort.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 4f6b438e7b..bbb3f35640 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -135,8 +135,8 @@ preparePresortedCols(IncrementalSortState *node)
 		key = &node->presorted_keys[i];
 		key->attno = plannode->sort.sortColIdx[i];
 
-		equalityOp = get_equality_op_for_ordering_op(
-													 plannode->sort.sortOperators[i], NULL);
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
 		if (!OidIsValid(equalityOp))
 			elog(ERROR, "missing equality operator for ordering operator %u",
 				 plannode->sort.sortOperators[i]);
@@ -265,8 +265,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 		 * Optimize the sort by assuming the prefix columns are all equal and
 		 * thus we only need to sort by any remaining columns.
 		 */
-		prefixsort_state = tuplesort_begin_heap(
-												tupDesc,
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
 												plannode->sort.numCols - presortedCols,
 												&(plannode->sort.sortColIdx[presortedCols]),
 												&(plannode->sort.sortOperators[presortedCols]),
@@ -576,8 +575,7 @@ ExecIncrementalSort(PlanState *pstate)
 			 * setup the full sort tuplesort to sort by all requested sort
 			 * columns.
 			 */
-			fullsort_state = tuplesort_begin_heap(
-												  tupDesc,
+			fullsort_state = tuplesort_begin_heap(tupDesc,
 												  plannode->sort.numCols,
 												  plannode->sort.sortColIdx,
 												  plannode->sort.sortOperators,
@@ -1011,10 +1009,12 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
 
 	/* make standalone slot to store previous tuple from outer node */
-	incrsortstate->group_pivot = MakeSingleTupleTableSlot(
-														  ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
-	incrsortstate->transfer_tuple = MakeSingleTupleTableSlot(
-															 ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
 
 	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
 
-- 
2.20.1

0006-use-castNode-instead-of-Assert-IsA-plus-cast.patchtext/x-diff; charset=us-asciiDownload

From be20eb23c85a5e53d7406a21a1e0ca5c5006c2ea Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 12 Mar 2020 18:29:52 -0300
Subject: [PATCH 6/8] use castNode() instead of Assert(IsA) plus cast

---
 src/backend/executor/nodeIncrementalSort.c | 26 ++++++++--------------
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index bbb3f35640..be1afbb169 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -115,18 +115,14 @@ instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 static void
 preparePresortedCols(IncrementalSortState *node)
 {
-	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
-	int			presortedCols,
-				i;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
 
-	Assert(IsA(plannode, IncrementalSort));
-	presortedCols = plannode->presortedCols;
-
-	node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
-													   sizeof(PresortedKeyData));
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
 
 	/* Pre-cache comparison functions for each pre-sorted key. */
-	for (i = 0; i < presortedCols; i++)
+	for (int i = 0; i < plannode->presortedCols; i++)
 	{
 		Oid			equalityOp,
 					equalityFunc;
@@ -162,17 +158,13 @@ preparePresortedCols(IncrementalSortState *node)
  *
  * We do this by comparing its first 'presortedCols' column values to
  * the pivot tuple of the current group.
- *
  */
 static bool
 isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
 {
-	int			presortedCols,
-				i;
+	int			presortedCols;
 
-	Assert(IsA(node->ss.ps.plan, IncrementalSort));
-
-	presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
 
 	/*
 	 * That the input is sorted by keys * (0, ... n) implies that the tail
@@ -180,7 +172,7 @@ isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot
 	 * from the last pre-sorted column to optimize for early detection of
 	 * inequality and minimizing the number of function calls..
 	 */
-	for (i = presortedCols - 1; i >= 0; i--)
+	for (int i = presortedCols - 1; i >= 0; i--)
 	{
 		Datum		datumA,
 					datumB,
@@ -250,7 +242,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 	bool		firstTuple = true;
 	TupleDesc	tupDesc;
 	PlanState  *outerNode;
-	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
 
 	dir = node->ss.ps.state->es_direction;
 	outerNode = outerPlanState(node);
-- 
2.20.1

0007-Test-trivial-condition-before-more-complex-one.patchtext/x-diff; charset=us-asciiDownload

From 0404936af5b9f2cc420863f505dae6f9085440ad Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 12 Mar 2020 18:32:53 -0300
Subject: [PATCH 7/8] Test trivial condition before more complex one

---
 src/backend/executor/nodeIncrementalSort.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index be1afbb169..909d2df53f 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -499,9 +499,9 @@ ExecIncrementalSort(PlanState *pstate)
 		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
 			fullsort_state : node->prefixsort_state;
 		slot = node->ss.ps.ps_ResultTupleSlot;
-		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
-								   false, slot, NULL) || node->finished)
-
+		if (node->finished ||
+			tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL))
 			/*
 			 * TODO: there isn't a good test case for the node->finished case
 			 * directly, but lots of other stuff fails if it's not there. If
-- 
2.20.1

0008-reverse-arguments-.-isn-t-the-other-order-a-bug.patchtext/x-diff; charset=us-asciiDownload

From 04c45e4047ae10cc5706fadd9075d2e7c1ff1411 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 12 Mar 2020 18:33:02 -0300
Subject: [PATCH 8/8] reverse arguments .. isn't the other order a bug?

---
 src/backend/executor/nodeIncrementalSort.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 909d2df53f..bb88fca207 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -653,7 +653,7 @@ ExecIncrementalSort(PlanState *pstate)
 										  &node->incsort_info.fullsortGroupInfo,
 										  fullsort_state);
 
-				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple) \n");
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
 				node->execution_status = INCSORT_READFULLSORT;
 				break;
 			}
@@ -713,7 +713,8 @@ ExecIncrementalSort(PlanState *pstate)
 						 * configuring sorting bound.
 						 */
 						SO2_printf("Changing bound_Done from %ld to %ld\n",
-								   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
 						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
 					}
 
@@ -721,7 +722,8 @@ ExecIncrementalSort(PlanState *pstate)
 					 * Once we find changed prefix keys we can complete the
 					 * sort and begin reading out the sorted tuples.
 					 */
-					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
 					tuplesort_performsort(fullsort_state);
 
 					if (pstate->instrument != NULL)
@@ -882,7 +884,8 @@ ExecIncrementalSort(PlanState *pstate)
 			 * in configuring sorting bound.
 			 */
 			SO2_printf("Changing bound_Done from %ld to %ld\n",
-					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
 		}
 	}
-- 
2.20.1

#196

Justin Pryzby

pryzby@telsasoft.com

almost 6 years ago

In reply to: Tomas Vondra (#194)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Thanks for working on this. I have some minor comments.

In 0005:

+ /* Restore the input path (we might have addes Sort on top). */

=> added? There's at least two more of the same typo.

+ /* also ignore already sorted paths */

=> You say that in a couple places, but I don't think "also" makes sense since
there's nothing preceding it ?

In 0004:

+ * end up resorting the entire data set. So, unless we can push

=> re-sorting

+ * Unlike generate_gather_paths, this does not look just as pathkeys of the

=> look just AT ?

+ /* now we know is_sorted == false */

=> I would just spell that "Assert", as I think you already do elsewhere.

+ /* continue */

=> Please consider saying "fall through", since "continue" means exactly the
opposite.

+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
...
+			/* finally, consider incremental sort */
...
+				/* Also consider incremental sort. */

=> I think it's more confusing than useful with two comments - one is adequate.

In 0002:

+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
...
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node

=> AN incremental

+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024

Four typos (?)
that array DOESN'T
and THE overhead
CONSIDER
I'm not sure, but "be possible less" should maybe say "possibly be less" ?

+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk

I suggest to call it IsMaxSpaceDisk

+	MemoryContext maincontext;	/* memory context for tuple sort metadata
+					   that persist across multiple batches */

persists

+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.

allows to avoid? Or allows avoiding?

+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".

"already presorted" sounds redundant

+	int64		fullsort_group_count;	/* number of groups with equal presorted keys */
+	int64		prefixsort_group_count;	/* number of groups with equal presorted keys */

I guess these should have different comments

--
Justin

#197

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Alvaro Herrera (#195)

4 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Mar 12, 2020 at 5:53 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I gave this a very quick look; I don't claim to understand it or
anything, but I thought these trivial cleanups worthwhile. The only
non-cosmetic thing is changing order of arguments to the SOn_printf()
calls in 0008; I think they are contrary to what the comment says.

Yes, I think you're correct (re: 0008).

They all look generally good to me, and are included in the attached
patch series.

I don't propose to commit 0003 of course, since it's not our policy;
that's just to allow running pgindent sanely, which gives you 0004
(though my local pgindent has an unrelated fix). And after that you
notice the issue that 0005 fixes.

Is there a page on how you're supposed to run pgindent/when stuff like
this does get added/etc.? It's all a big mystery to me right now.

Also, I noticed some of the pgindent changes aren't for changes in
this patch series; I have that as a separate patch, but not attached
because I see that running pgindent locally generates a massive patch,
so I'm assuming we just ignore those for now?

I did notice that show_incremental_sort_group_info() seems to be doing
things in a hard way, or something. I got there because it throws this
warning:

/pgsql/source/master/src/backend/commands/explain.c: In function 'show_incremental_sort_group_info':
/pgsql/source/master/src/backend/commands/explain.c:2766:39: warning: passing argument 2 of 'lappend' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
methodNames = lappend(methodNames, sortMethodName);
^~~~~~~~~~~~~~
In file included from /pgsql/source/master/src/include/access/xact.h:20,
from /pgsql/source/master/src/backend/commands/explain.c:16:
/pgsql/source/master/src/include/nodes/pg_list.h:509:14: note: expected 'void *' but argument is of type 'const char *'
extern List *lappend(List *list, void *datum);
^~~~~~~
/pgsql/source/master/src/backend/commands/explain.c:2766:39: warning: passing 'const char *' to parameter of type 'void *' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
methodNames = lappend(methodNames, sortMethodName);
^~~~~~~~~~~~~~
/pgsql/source/master/src/include/nodes/pg_list.h:509:40: note: passing argument to parameter 'datum' here
extern List *lappend(List *list, void *datum);
^
1 warning generated.

(Eh, it's funny that GCC reports two warnings about the same line, and
then says there's one warning.)

I had seen this before I sent the patch, but then it seemed like it
disappeared, so I didn't come back to it; maybe I just missed it in my
buffer.

I do see it now, and moving the declarations into each relevant block
(rather than trying to share them) seems to fix it. I think that's
correct anyway, since before they were technically being assigned to
more than once which seems wrong for const.

I have this change locally and will include it in my next patch version.

I suppose you could silence this by adding pstrdup(), and then use
list_free_deep (you have to put the sortMethodName declaration in the
inner scope for that, but seems fine). Or maybe there's a clever way
around it.

But I hesitate to send a patch for that because I think the whole
function is written by handling text and the other outputs completely
separately -- but looking for example show_modifytable_info() it seems
you can do ExplainOpenGroup, ExplainPropertyText, ExplainPropertyList
etc in all explain output modes, and those routines will care about
emitting the data in the correct format, without having the
show_incremental_sort_group_info function duplicate everything.

I'm not sure how that would work: those functions (for
EXPLAIN_FORMAT_TEXT) all add newlines, and this code is intentionally
trying to avoid too many lines.

I'm open to suggestions though.

HTH. I would really like to get this patch done for pg13.

As would I!

James

Attachments:

v35-0001-Consider-low-startup-cost-when-adding-partial-pa.patchapplication/octet-stream; name=v35-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 5252de9888e9e676d8dcb8efa840199633b85d9d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v35 1/5] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 47 ++++++++++-----------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d9ce516211..3e836e6e1c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -777,41 +777,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Unless pathkeys are incompatible, keep just one of the two paths. */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
-			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.20.1 (Apple Git-117)

v35-0004-A-couple-more-places-for-incremental-sort.patchapplication/octet-stream; name=v35-0004-A-couple-more-places-for-incremental-sort.patchDownload

From eeb1282a0a1d45626be68f2fe374d76caceae486 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH v35 4/5] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/plan/planner.c   | 218 ++++++++++++++++++++++++-
 2 files changed, 216 insertions(+), 4 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c2b76d7675..02958e36c7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5070,6 +5070,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we
+				 * can't simply skip it, because it may be partially sorted in
+				 * which case we want to consider incremental sort on top of
+				 * it (instead of full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* also ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6570,12 +6631,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6606,6 +6673,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have addes Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6875,6 +6992,60 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Also consider incremental sort on all partially sorted paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* also ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* add incremental sort */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -7067,10 +7238,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7096,6 +7268,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7197,7 +7409,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.20.1 (Apple Git-117)

v35-0002-Implement-incremental-sort.patchapplication/octet-stream; name=v35-0002-Implement-incremental-sort.patchDownload

From 4633abf53684cf02355e88c9c96fe0959c22d865 Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Fri, 27 Sep 2019 19:36:53 +0000
Subject: [PATCH v35 2/5] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   33 +
 src/backend/executor/nodeIncrementalSort.c    | 1179 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   73 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  194 ++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   77 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   11 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1320 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   88 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3765 insertions(+), 125 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 371d7838fb..64ea00f462 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4490,6 +4490,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..e73038b0cd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1239,6 +1243,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1897,6 +1904,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2225,12 +2238,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2241,7 +2271,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2265,7 +2295,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2334,7 +2364,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2391,7 +2421,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2404,13 +2434,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2450,9 +2481,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2666,6 +2701,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const	   *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, sortMethodName);
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const	   *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const	   *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	if (!(es->analyze && incrsortstate->sort_Done))
+		return;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+	if (fullsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * XXX: The previous version of the patch chcked:
+			 * fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+			 * and continued if the condition was true (with the comment
+			 * "ignore any unfilled slots"). I'm not convinced that makes
+			 * sense since the same sort instrument can have been used
+			 * multiple times, so the last time it being used being still in
+			 * progress, doesn't seem to be relevant. Instead I'm now checking
+			 * to see if the group count for each group info is 0. If both are
+			 * 0, then we exclude the worker since it didn't contribute
+			 * anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..d15a86a706 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..bb88fca207
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1179 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next set up tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner
+		 * node read out all of those tuples, and then come back around to
+		 * find another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (node->finished ||
+			tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL))
+			/*
+			 * TODO: there isn't a good test case for the node->finished case
+			 * directly, but lots of other stuff fails if it's not there. If
+			 * the outer node will fail when trying to fetch too many tuples,
+			 * then things break if that test isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() pulled the one of those groups
+			 * out, and we've returned those tuples to the inner node, but if
+			 * we tuples remaining in that tuplesort (i.e.,
+			 * n_fullsort_remaining > 0) at this point we need to do that
+			 * again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're
+			 * not in the middle of transitioning into presorted prefix sort
+			 * mode, then it's time to start the process all over again by
+			 * building new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort
+		 * here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us any more tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then
+				 * don't both with checking for inclusion in the current
+				 * prefix group since a large number of very tiny sorts is
+				 * inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this sort
+					 * group. Instead we need to carry it over to the next
+					 * group. We use the group_pivot slot as a temp container
+					 * for that purpose even though we won't actually treat it
+					 * as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for use in
+						 * configuring sorting bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found a
+			 * large group of tuples having a single prefix key (as long as
+			 * the last tuple didn't shift us into reading from the full sort
+			 * mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the
+				 * fullsort to presorted prefix sort (we might have multiple
+				 * prefix key groups, so we need a way to see if we've
+				 * actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've
+		 * already established a current group pivot tuple (but wasn't carried
+		 * over; it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix
+				 * tuplesort until we find changed prefix keys. Only then can
+				 * we guarantee sort stability of the tuples we've already
+				 * accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+		tuplesort_end(node->fullsort_state);
+	node->fullsort_state = NULL;
+	if (node->prefixsort_state != NULL)
+		tuplesort_end(node->prefixsort_state);
+	node->prefixsort_state = NULL;
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	tuplesort_end(node->fullsort_state);
+	node->prefixsort_state = NULL;
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721..f73d0782f5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..74799cd8fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none. Any
+	 * leading common pathkeys could be useful for ordering because we can use
+	 * the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..026a60b946 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314..55fe2a935c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4868,7 +4868,7 @@ create_distinct_paths(PlannerInfo *root,
 	else
 	{
 		Size		hashentrysize = hash_agg_entry_size(
-			0, cheapest_input_path->pathtarget->width, 0);
+														0, cheapest_input_path->pathtarget->width, 0);
 
 		/* Allow hashing only if hashtable is predicted to fit in work_mem */
 		allow_hash = (hashentrysize * numDistinctRows <= work_mem * 1024L);
@@ -4924,8 +4924,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4964,29 +4964,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3e836e6e1c..11e6fce9d1 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2741,6 +2741,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c6d648662..4949ef2079 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..2c2efff0a6 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +250,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata that
+								 * persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1250,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1323,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2724,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2774,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3271,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f..c96f03e48d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2022,69 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node is finished
+								 * ? */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..fe4046b64b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba198..34f18bd73a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..7892b111d7
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1320 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+                                           QUERY PLAN                                            
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: 27kB (avg), 27kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ [                                                                +
+   {                                                              +
+     "Plan": {                                                    +
+       "Node Type": "Limit",                                      +
+       "Parallel Aware": false,                                   +
+       "Actual Rows": 55,                                         +
+       "Actual Loops": 1,                                         +
+       "Plans": [                                                 +
+         {                                                        +
+           "Node Type": "Incremental Sort",                       +
+           "Parent Relationship": "Outer",                        +
+           "Parallel Aware": false,                               +
+           "Actual Rows": 55,                                     +
+           "Actual Loops": 1,                                     +
+           "Sort Key": ["t.a", "t.b"],                            +
+           "Presorted Key": ["t.a"],                              +
+           "Full-sort Groups": {                                  +
+             "Group Count": 2,                                    +
+             "Sort Methods Used": ["quicksort", "top-N heapsort"],+
+             "Average Sort Space Used": 27,                       +
+             "Maximum Sort Space Used": 27,                       +
+             "Sort Space Type": "Memory"                          +
+           },                                                     +
+           "Plans": [                                             +
+             {                                                    +
+               "Node Type": "Sort",                               +
+               "Parent Relationship": "Outer",                    +
+               "Parallel Aware": false,                           +
+               "Actual Rows": 100,                                +
+               "Actual Loops": 1,                                 +
+               "Sort Key": ["t.a"],                               +
+               "Sort Method": "quicksort",                        +
+               "Sort Space Used": 30,                             +
+               "Sort Space Type": "Memory",                       +
+               "Plans": [                                         +
+                 {                                                +
+                   "Node Type": "Seq Scan",                       +
+                   "Parent Relationship": "Outer",                +
+                   "Parallel Aware": false,                       +
+                   "Relation Name": "t",                          +
+                   "Alias": "t",                                  +
+                   "Actual Rows": 100,                            +
+                   "Actual Loops": 1                              +
+                 }                                                +
+               ]                                                  +
+             }                                                    +
+           ]                                                      +
+         }                                                        +
+       ]                                                          +
+     },                                                           +
+     "Triggers": [                                                +
+     ]                                                            +
+   }                                                              +
+ ]
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: 28kB (avg), 28kB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: 26kB (avg), 30kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+                    QUERY PLAN                     
+---------------------------------------------------
+ [                                                +
+   {                                              +
+     "Plan": {                                    +
+       "Node Type": "Limit",                      +
+       "Parallel Aware": false,                   +
+       "Actual Rows": 70,                         +
+       "Actual Loops": 1,                         +
+       "Plans": [                                 +
+         {                                        +
+           "Node Type": "Incremental Sort",       +
+           "Parent Relationship": "Outer",        +
+           "Parallel Aware": false,               +
+           "Actual Rows": 70,                     +
+           "Actual Loops": 1,                     +
+           "Sort Key": ["t.a", "t.b"],            +
+           "Presorted Key": ["t.a"],              +
+           "Full-sort Groups": {                  +
+             "Group Count": 1,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 28,       +
+             "Maximum Sort Space Used": 28,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Presorted Groups": {                  +
+             "Group Count": 5,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 26,       +
+             "Maximum Sort Space Used": 30,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Plans": [                             +
+             {                                    +
+               "Node Type": "Sort",               +
+               "Parent Relationship": "Outer",    +
+               "Parallel Aware": false,           +
+               "Actual Rows": 100,                +
+               "Actual Loops": 1,                 +
+               "Sort Key": ["t.a"],               +
+               "Sort Method": "quicksort",        +
+               "Sort Space Used": 30,             +
+               "Sort Space Type": "Memory",       +
+               "Plans": [                         +
+                 {                                +
+                   "Node Type": "Seq Scan",       +
+                   "Parent Relationship": "Outer",+
+                   "Parallel Aware": false,       +
+                   "Relation Name": "t",          +
+                   "Alias": "t",                  +
+                   "Actual Rows": 100,            +
+                   "Actual Loops": 1              +
+                 }                                +
+               ]                                  +
+             }                                    +
+           ]                                      +
+         }                                        +
+       ]                                          +
+     },                                           +
+     "Triggers": [                                +
+     ]                                            +
+   }                                              +
+ ]
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..9320a10b91
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,88 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.20.1 (Apple Git-117)

v35-0003-Consider-incremental-sort-paths-in-additional-pl.patchapplication/octet-stream; name=v35-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From 7fe5a5a39dd53b4cbf5bdb3decc7263b75594776 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v35 3/5] Consider incremental sort paths in additional places

---
 src/backend/optimizer/path/allpaths.c | 222 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  | 130 ++++++++++++++-
 src/include/optimizer/paths.h         |   2 +
 3 files changed, 351 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..e0bb71dd51 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,224 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Pushing the query_pathkeys to the remote server is always worth
+	 * considering, because it might let us avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * The planner and executor don't have any clever strategy for
+			 * taking data sorted by a prefix of the query's pathkeys and
+			 * getting it to be sorted by all of those pathkeys. We'll just
+			 * end up resorting the entire data set.  So, unless we can push
+			 * down all of the query pathkeys, forget it.
+			 *
+			 * is_foreign_expr would detect volatile expressions as well, but
+			 * checking ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		/*
+		 * This ends up allowing us to do incremental sort on top of an index
+		 * scan all parallelized under a gather merge node.
+		 */
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike generate_gather_paths, this does not look just as pathkeys of the
+ * input paths (aiming to preserve the ordering). It also considers ordering
+ * that might be useful by nodes above the gather merge node, and tries to
+ * add a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather merge paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			/* now we know is_sorted == false */
+
+			/*
+			 * consider regular sort for cheapest partial path (for each
+			 * useful pathkeys)
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* continue */
+			}
+
+			/* finally, consider incremental sort */
+			if (presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				/* Also consider incremental sort. */
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3117,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 55fe2a935c..c2b76d7675 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6424,7 +6424,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6483,6 +6485,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have addes Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6807,7 +6883,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6842,6 +6920,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have addes Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7223,7 +7351,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..f6994779de 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
-- 
2.20.1 (Apple Git-117)

#198

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#194)

4 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...

Now, a couple comments about parts 0001 - 0003 of the patch ...

1) I see a bunch of failures in the regression test, due to minor
differences in the explain output. All the differences are about minor
changes in memory usage, like this:
-               "Sort Space Used": 30,                             +
+               "Sort Space Used": 29,                             +
I'm not sure if it happens on my machine only, but maybe the test is not
entirely stable.

make check passes on multiple machines for me; what arch/distro are you using?

Is there a better way to test these? I would prefer these code paths
have test coverage, but the standard SQL tests don't leave a good way
to handle stuff like this.

Is TAP the only alternative, and do you think it'd be worth considering?

2) I think this bit in ExecReScanIncrementalSort is wrong:

node->sort_Done = false;
tuplesort_end(node->fullsort_state);
node->prefixsort_state = NULL;
tuplesort_end(node->fullsort_state);
node->prefixsort_state = NULL;
node->bound_Done = 0;

Notice both places reset fullsort_state and set prefixsort_state to
NULL. Another thing is that I'm not sure it's fine to pass NULL to
tuplesort_end (my guess is tuplesort_free will fail when it gets NULL).

Fixed.

James

Attachments:

v36-0001-Consider-low-startup-cost-when-adding-partial-pa.patchapplication/octet-stream; name=v36-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 5252de9888e9e676d8dcb8efa840199633b85d9d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v36 1/4] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 47 ++++++++++-----------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d9ce516211..3e836e6e1c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -777,41 +777,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Unless pathkeys are incompatible, keep just one of the two paths. */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
-			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.20.1 (Apple Git-117)

v36-0003-Consider-incremental-sort-paths-in-additional-pl.patchapplication/octet-stream; name=v36-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From e8374b2e93d6278ebc624a25f3662876116c6296 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v36 3/4] Consider incremental sort paths in additional places

---
 src/backend/optimizer/path/allpaths.c | 222 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  | 130 ++++++++++++++-
 src/include/optimizer/paths.h         |   2 +
 3 files changed, 351 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..e0bb71dd51 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,224 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Pushing the query_pathkeys to the remote server is always worth
+	 * considering, because it might let us avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * The planner and executor don't have any clever strategy for
+			 * taking data sorted by a prefix of the query's pathkeys and
+			 * getting it to be sorted by all of those pathkeys. We'll just
+			 * end up resorting the entire data set.  So, unless we can push
+			 * down all of the query pathkeys, forget it.
+			 *
+			 * is_foreign_expr would detect volatile expressions as well, but
+			 * checking ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		/*
+		 * This ends up allowing us to do incremental sort on top of an index
+		 * scan all parallelized under a gather merge node.
+		 */
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike generate_gather_paths, this does not look just as pathkeys of the
+ * input paths (aiming to preserve the ordering). It also considers ordering
+ * that might be useful by nodes above the gather merge node, and tries to
+ * add a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather merge paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			/* now we know is_sorted == false */
+
+			/*
+			 * consider regular sort for cheapest partial path (for each
+			 * useful pathkeys)
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* continue */
+			}
+
+			/* finally, consider incremental sort */
+			if (presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				/* Also consider incremental sort. */
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3117,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 55fe2a935c..c2b76d7675 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6424,7 +6424,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6483,6 +6485,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have addes Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6807,7 +6883,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6842,6 +6920,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have addes Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7223,7 +7351,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..f6994779de 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
-- 
2.20.1 (Apple Git-117)

v36-0002-Implement-incremental-sort.patchapplication/octet-stream; name=v36-0002-Implement-incremental-sort.patchDownload

From 71d6780e0adc0963724036c308ed7a0df4204115 Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Fri, 27 Sep 2019 19:36:53 +0000
Subject: [PATCH v36 2/4] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   33 +
 src/backend/executor/nodeIncrementalSort.c    | 1189 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   73 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  194 ++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   77 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   11 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1320 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   88 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3775 insertions(+), 125 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 371d7838fb..64ea00f462 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4490,6 +4490,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..e73038b0cd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1239,6 +1243,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1897,6 +1904,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2225,12 +2238,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2241,7 +2271,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2265,7 +2295,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2334,7 +2364,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2391,7 +2421,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2404,13 +2434,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2450,9 +2481,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2666,6 +2701,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const	   *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, sortMethodName);
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const	   *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const	   *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	if (!(es->analyze && incrsortstate->sort_Done))
+		return;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+	if (fullsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * XXX: The previous version of the patch chcked:
+			 * fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+			 * and continued if the condition was true (with the comment
+			 * "ignore any unfilled slots"). I'm not convinced that makes
+			 * sense since the same sort instrument can have been used
+			 * multiple times, so the last time it being used being still in
+			 * progress, doesn't seem to be relevant. Instead I'm now checking
+			 * to see if the group count for each group info is 0. If both are
+			 * 0, then we exclude the worker since it didn't contribute
+			 * anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..d15a86a706 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1eba11bb0c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1189 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next set up tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner
+		 * node read out all of those tuples, and then come back around to
+		 * find another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (node->finished ||
+			tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL))
+			/*
+			 * TODO: there isn't a good test case for the node->finished case
+			 * directly, but lots of other stuff fails if it's not there. If
+			 * the outer node will fail when trying to fetch too many tuples,
+			 * then things break if that test isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() pulled the one of those groups
+			 * out, and we've returned those tuples to the inner node, but if
+			 * we tuples remaining in that tuplesort (i.e.,
+			 * n_fullsort_remaining > 0) at this point we need to do that
+			 * again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're
+			 * not in the middle of transitioning into presorted prefix sort
+			 * mode, then it's time to start the process all over again by
+			 * building new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort
+		 * here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us any more tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then
+				 * don't both with checking for inclusion in the current
+				 * prefix group since a large number of very tiny sorts is
+				 * inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this sort
+					 * group. Instead we need to carry it over to the next
+					 * group. We use the group_pivot slot as a temp container
+					 * for that purpose even though we won't actually treat it
+					 * as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for use in
+						 * configuring sorting bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found a
+			 * large group of tuples having a single prefix key (as long as
+			 * the last tuple didn't shift us into reading from the full sort
+			 * mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the
+				 * fullsort to presorted prefix sort (we might have multiple
+				 * prefix key groups, so we need a way to see if we've
+				 * actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've
+		 * already established a current group pivot tuple (but wasn't carried
+		 * over; it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix
+				 * tuplesort until we find changed prefix keys. Only then can
+				 * we guarantee sort stability of the tuples we've already
+				 * accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721..f73d0782f5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..74799cd8fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none. Any
+	 * leading common pathkeys could be useful for ordering because we can use
+	 * the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..026a60b946 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314..55fe2a935c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4868,7 +4868,7 @@ create_distinct_paths(PlannerInfo *root,
 	else
 	{
 		Size		hashentrysize = hash_agg_entry_size(
-			0, cheapest_input_path->pathtarget->width, 0);
+														0, cheapest_input_path->pathtarget->width, 0);
 
 		/* Allow hashing only if hashtable is predicted to fit in work_mem */
 		allow_hash = (hashentrysize * numDistinctRows <= work_mem * 1024L);
@@ -4924,8 +4924,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4964,29 +4964,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3e836e6e1c..11e6fce9d1 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2741,6 +2741,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c6d648662..4949ef2079 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..2c2efff0a6 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +250,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		maxSpaceOnDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata that
+								 * persist across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1250,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1323,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		spaceUsedOnDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		spaceUsedOnDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		spaceUsedOnDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+		(spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->maxSpaceOnDisk = spaceUsedOnDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *	when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2724,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2774,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3271,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->maxSpaceOnDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f..c96f03e48d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,20 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys input dataset could be already
+ *	 presorted by some prefix of these keys.  We call them "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2022,69 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node is finished
+								 * ? */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..fe4046b64b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba198..34f18bd73a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..7892b111d7
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1320 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+                                           QUERY PLAN                                            
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: 27kB (avg), 27kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ [                                                                +
+   {                                                              +
+     "Plan": {                                                    +
+       "Node Type": "Limit",                                      +
+       "Parallel Aware": false,                                   +
+       "Actual Rows": 55,                                         +
+       "Actual Loops": 1,                                         +
+       "Plans": [                                                 +
+         {                                                        +
+           "Node Type": "Incremental Sort",                       +
+           "Parent Relationship": "Outer",                        +
+           "Parallel Aware": false,                               +
+           "Actual Rows": 55,                                     +
+           "Actual Loops": 1,                                     +
+           "Sort Key": ["t.a", "t.b"],                            +
+           "Presorted Key": ["t.a"],                              +
+           "Full-sort Groups": {                                  +
+             "Group Count": 2,                                    +
+             "Sort Methods Used": ["quicksort", "top-N heapsort"],+
+             "Average Sort Space Used": 27,                       +
+             "Maximum Sort Space Used": 27,                       +
+             "Sort Space Type": "Memory"                          +
+           },                                                     +
+           "Plans": [                                             +
+             {                                                    +
+               "Node Type": "Sort",                               +
+               "Parent Relationship": "Outer",                    +
+               "Parallel Aware": false,                           +
+               "Actual Rows": 100,                                +
+               "Actual Loops": 1,                                 +
+               "Sort Key": ["t.a"],                               +
+               "Sort Method": "quicksort",                        +
+               "Sort Space Used": 30,                             +
+               "Sort Space Type": "Memory",                       +
+               "Plans": [                                         +
+                 {                                                +
+                   "Node Type": "Seq Scan",                       +
+                   "Parent Relationship": "Outer",                +
+                   "Parallel Aware": false,                       +
+                   "Relation Name": "t",                          +
+                   "Alias": "t",                                  +
+                   "Actual Rows": 100,                            +
+                   "Actual Loops": 1                              +
+                 }                                                +
+               ]                                                  +
+             }                                                    +
+           ]                                                      +
+         }                                                        +
+       ]                                                          +
+     },                                                           +
+     "Triggers": [                                                +
+     ]                                                            +
+   }                                                              +
+ ]
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: 28kB (avg), 28kB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: 26kB (avg), 30kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+                    QUERY PLAN                     
+---------------------------------------------------
+ [                                                +
+   {                                              +
+     "Plan": {                                    +
+       "Node Type": "Limit",                      +
+       "Parallel Aware": false,                   +
+       "Actual Rows": 70,                         +
+       "Actual Loops": 1,                         +
+       "Plans": [                                 +
+         {                                        +
+           "Node Type": "Incremental Sort",       +
+           "Parent Relationship": "Outer",        +
+           "Parallel Aware": false,               +
+           "Actual Rows": 70,                     +
+           "Actual Loops": 1,                     +
+           "Sort Key": ["t.a", "t.b"],            +
+           "Presorted Key": ["t.a"],              +
+           "Full-sort Groups": {                  +
+             "Group Count": 1,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 28,       +
+             "Maximum Sort Space Used": 28,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Presorted Groups": {                  +
+             "Group Count": 5,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 26,       +
+             "Maximum Sort Space Used": 30,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Plans": [                             +
+             {                                    +
+               "Node Type": "Sort",               +
+               "Parent Relationship": "Outer",    +
+               "Parallel Aware": false,           +
+               "Actual Rows": 100,                +
+               "Actual Loops": 1,                 +
+               "Sort Key": ["t.a"],               +
+               "Sort Method": "quicksort",        +
+               "Sort Space Used": 30,             +
+               "Sort Space Type": "Memory",       +
+               "Plans": [                         +
+                 {                                +
+                   "Node Type": "Seq Scan",       +
+                   "Parent Relationship": "Outer",+
+                   "Parallel Aware": false,       +
+                   "Relation Name": "t",          +
+                   "Alias": "t",                  +
+                   "Actual Rows": 100,            +
+                   "Actual Loops": 1              +
+                 }                                +
+               ]                                  +
+             }                                    +
+           ]                                      +
+         }                                        +
+       ]                                          +
+     },                                           +
+     "Triggers": [                                +
+     ]                                            +
+   }                                              +
+ ]
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..9320a10b91
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,88 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.20.1 (Apple Git-117)

v36-0004-A-couple-more-places-for-incremental-sort.patchapplication/octet-stream; name=v36-0004-A-couple-more-places-for-incremental-sort.patchDownload

From beaabee4d11e9aeda8e1bdd99c24340f3797bb32 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH v36 4/4] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/plan/planner.c   | 218 ++++++++++++++++++++++++-
 2 files changed, 216 insertions(+), 4 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c2b76d7675..02958e36c7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5070,6 +5070,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we
+				 * can't simply skip it, because it may be partially sorted in
+				 * which case we want to consider incremental sort on top of
+				 * it (instead of full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* also ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6570,12 +6631,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6606,6 +6673,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have addes Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6875,6 +6992,60 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Also consider incremental sort on all partially sorted paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* also ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* add incremental sort */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -7067,10 +7238,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7096,6 +7268,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7197,7 +7409,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.20.1 (Apple Git-117)

#199

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: James Coleman (#198)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

James Coleman <jtc331@gmail.com> writes:

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
1) I see a bunch of failures in the regression test, due to minor
differences in the explain output. All the differences are about minor
changes in memory usage, like this:
-               "Sort Space Used": 30,                             +
+               "Sort Space Used": 29,                             +
I'm not sure if it happens on my machine only, but maybe the test is not
entirely stable.

make check passes on multiple machines for me; what arch/distro are you using?

I think there's exactly zero chance of such output being stable across
different platforms, particularly 32-vs-64-bit. You'll need to either
drop that test or find some way to mask the variability.

Is there a better way to test these? I would prefer these code paths
have test coverage, but the standard SQL tests don't leave a good way
to handle stuff like this.

In some places we use plpgsql code to filter the EXPLAIN output.

regards, tom lane

#200

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Tom Lane (#199)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Hi,

On 2020-03-13 13:36:44 -0400, Tom Lane wrote:

James Coleman <jtc331@gmail.com> writes:
On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
1) I see a bunch of failures in the regression test, due to minor
differences in the explain output. All the differences are about minor
changes in memory usage, like this:
-               "Sort Space Used": 30,                             +
+               "Sort Space Used": 29,                             +
I'm not sure if it happens on my machine only, but maybe the test is not
entirely stable.
make check passes on multiple machines for me; what arch/distro are you using?

I think there's exactly zero chance of such output being stable across
different platforms, particularly 32-vs-64-bit. You'll need to either
drop that test or find some way to mask the variability.

Is there a better way to test these? I would prefer these code paths
have test coverage, but the standard SQL tests don't leave a good way
to handle stuff like this.

In some places we use plpgsql code to filter the EXPLAIN output.

I still think we should just go for a REPRODUCIBLE, TESTING, REGRESS or
similar EXPLAIN option, instead of playing whack-a-mole. Due to the
amount of discussion, the reduced test coverage, the increased test
complexity, the reduced quality of explain for humans we are well beyond
the point of making the cost of such an option worth it.

Greetings,

Andres Freund

#201

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Justin Pryzby (#196)

4 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Mar 12, 2020 at 7:40 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Thanks for working on this. I have some minor comments.

In 0005:

+ /* Restore the input path (we might have addes Sort on top). */

=> added? There's at least two more of the same typo.

Fixed.

+ /* also ignore already sorted paths */

=> You say that in a couple places, but I don't think "also" makes sense since
there's nothing preceding it ?

Updated.

In 0004:

+ * end up resorting the entire data set. So, unless we can push

=> re-sorting

Fixed in this patch; that also shows up in
contrib/postgres_fdw/postgres_fdw.c, but I'll leave that alone.

+ * Unlike generate_gather_paths, this does not look just as pathkeys of the

=> look just AT ?

Fixed.

+ /* now we know is_sorted == false */

=> I would just spell that "Assert", as I think you already do elsewhere.

+ /* continue */

=> Please consider saying "fall through", since "continue" means exactly the
opposite.

Updated.

+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
...
+                       /* finally, consider incremental sort */
...
+                               /* Also consider incremental sort. */

=> I think it's more confusing than useful with two comments - one is adequate.

Also fixed.

In 0002:

+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
...
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node

=> AN incremental

Fixed.

+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
Four typos (?)
that array DOESN'T
and THE overhead
CONSIDER
I'm not sure, but "be possible less" should maybe say "possibly be less" ?

Fixed.

+ bool maxSpaceOnDisk; /* true when maxSpace is value for on-disk

I suggest to call it IsMaxSpaceDisk

Changed, though with lowercase 'I' (let me know if using uppercase is
standard here).

+       MemoryContext maincontext;      /* memory context for tuple sort metadata
+                                          that persist across multiple batches */

persists

Fixed.

+ *     a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *     when sorting multiple small batches.

allows to avoid? Or allows avoiding?

Fixed.

+ *      When performing sorting by multiple keys input dataset could be already
+ *      presorted by some prefix of these keys.  We call them "presorted keys".

"already presorted" sounds redundant

Reworded.

+       int64           fullsort_group_count;   /* number of groups with equal presorted keys */
+       int64           prefixsort_group_count; /* number of groups with equal presorted keys */

I guess these should have different comments

The structure of that changed in my patch from a fews days ago, I
believe, so there aren't two fields anymore. Are you reviewing the
current patch?

Thanks,
James

Attachments:

v37-0004-A-couple-more-places-for-incremental-sort.patchapplication/octet-stream; name=v37-0004-A-couple-more-places-for-incremental-sort.patchDownload

From c630ce9001c50c29d537af97b104fc4bcada5df8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH v37 4/4] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/path/allpaths.c  |  11 +-
 src/backend/optimizer/plan/planner.c   | 220 ++++++++++++++++++++++++-
 3 files changed, 222 insertions(+), 11 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index e0bb71dd51..3ebe1c3262 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2789,7 +2789,7 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 			 * The planner and executor don't have any clever strategy for
 			 * taking data sorted by a prefix of the query's pathkeys and
 			 * getting it to be sorted by all of those pathkeys. We'll just
-			 * end up resorting the entire data set.  So, unless we can push
+			 * end up re-sorting the entire data set.  So, unless we can push
 			 * down all of the query pathkeys, forget it.
 			 *
 			 * is_foreign_expr would detect volatile expressions as well, but
@@ -2819,7 +2819,7 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
  *		Generate parallel access paths for a relation by pushing a Gather or
  *		Gather Merge on top of a partial path.
  *
- * Unlike generate_gather_paths, this does not look just as pathkeys of the
+ * Unlike generate_gather_paths, this does not look only at pathkeys of the
  * input paths (aiming to preserve the ordering). It also considers ordering
  * that might be useful by nodes above the gather merge node, and tries to
  * add a sort (regular or incremental) to provide that.
@@ -2889,7 +2889,7 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 				continue;
 			}
 
-			/* now we know is_sorted == false */
+			Assert(!is_sorted);
 
 			/*
 			 * consider regular sort for cheapest partial path (for each
@@ -2916,15 +2916,14 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 
 				add_path(rel, &path->path);
 
-				/* continue */
+				/* Fall through */
 			}
 
-			/* finally, consider incremental sort */
+			/* Also consider incremental sort. */
 			if (presorted_keys > 0)
 			{
 				Path	   *tmp;
 
-				/* Also consider incremental sort. */
 				tmp = (Path *) create_incremental_sort_path(root,
 															rel,
 															subpath,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4411fc515a..b92b65b543 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5070,6 +5070,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we
+				 * can't simply skip it, because it may be partially sorted in
+				 * which case we want to consider incremental sort on top of
+				 * it (instead of full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6570,12 +6631,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6606,6 +6673,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6875,6 +6992,60 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Also consider incremental sort on all partially sorted paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6939,10 +7110,10 @@ create_partial_grouping_paths(PlannerInfo *root,
 			/* We've already skipped fully sorted paths above. */
 			Assert(!is_sorted);
 
-			/* no shared prefix, not point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
+			/* Since we have presorted keys, consider incremental sort. */
 			path = (Path *) create_incremental_sort_path(root,
 														 partially_grouped_rel,
 														 path,
@@ -7067,10 +7238,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7096,6 +7268,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7197,7 +7409,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.20.1 (Apple Git-117)

v37-0001-Consider-low-startup-cost-when-adding-partial-pa.patchapplication/octet-stream; name=v37-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 5252de9888e9e676d8dcb8efa840199633b85d9d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v37 1/4] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 47 ++++++++++-----------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d9ce516211..3e836e6e1c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -777,41 +777,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Unless pathkeys are incompatible, keep just one of the two paths. */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
-			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.20.1 (Apple Git-117)

v37-0003-Consider-incremental-sort-paths-in-additional-pl.patchapplication/octet-stream; name=v37-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From 61615afd40f2359e58cf4f42212d73e31532e88a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v37 3/4] Consider incremental sort paths in additional places

---
 src/backend/optimizer/path/allpaths.c | 222 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  | 130 ++++++++++++++-
 src/include/optimizer/paths.h         |   2 +
 3 files changed, 351 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..e0bb71dd51 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,224 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Pushing the query_pathkeys to the remote server is always worth
+	 * considering, because it might let us avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * The planner and executor don't have any clever strategy for
+			 * taking data sorted by a prefix of the query's pathkeys and
+			 * getting it to be sorted by all of those pathkeys. We'll just
+			 * end up resorting the entire data set.  So, unless we can push
+			 * down all of the query pathkeys, forget it.
+			 *
+			 * is_foreign_expr would detect volatile expressions as well, but
+			 * checking ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		/*
+		 * This ends up allowing us to do incremental sort on top of an index
+		 * scan all parallelized under a gather merge node.
+		 */
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike generate_gather_paths, this does not look just as pathkeys of the
+ * input paths (aiming to preserve the ordering). It also considers ordering
+ * that might be useful by nodes above the gather merge node, and tries to
+ * add a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather merge paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			/* now we know is_sorted == false */
+
+			/*
+			 * consider regular sort for cheapest partial path (for each
+			 * useful pathkeys)
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* continue */
+			}
+
+			/* finally, consider incremental sort */
+			if (presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				/* Also consider incremental sort. */
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3117,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 55fe2a935c..4411fc515a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6424,7 +6424,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6483,6 +6485,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6807,7 +6883,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6842,6 +6920,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7223,7 +7351,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..f6994779de 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
-- 
2.20.1 (Apple Git-117)

v37-0002-Implement-incremental-sort.patchapplication/octet-stream; name=v37-0002-Implement-incremental-sort.patchDownload

From c5a78087d545a2fca2c30c032f0577235241fd0e Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Fri, 27 Sep 2019 19:36:53 +0000
Subject: [PATCH v37 2/4] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   33 +
 src/backend/executor/nodeIncrementalSort.c    | 1189 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   73 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  195 ++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   78 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   11 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1320 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   88 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3777 insertions(+), 125 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 371d7838fb..64ea00f462 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4490,6 +4490,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..e73038b0cd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1239,6 +1243,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1897,6 +1904,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2225,12 +2238,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2241,7 +2271,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2265,7 +2295,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2334,7 +2364,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2391,7 +2421,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2404,13 +2434,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2450,9 +2481,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2666,6 +2701,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const	   *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, sortMethodName);
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const	   *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const	   *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	if (!(es->analyze && incrsortstate->sort_Done))
+		return;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+	if (fullsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * XXX: The previous version of the patch chcked:
+			 * fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+			 * and continued if the condition was true (with the comment
+			 * "ignore any unfilled slots"). I'm not convinced that makes
+			 * sense since the same sort instrument can have been used
+			 * multiple times, so the last time it being used being still in
+			 * progress, doesn't seem to be relevant. Instead I'm now checking
+			 * to see if the group count for each group info is 0. If both are
+			 * 0, then we exclude the worker since it didn't contribute
+			 * anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..d15a86a706 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..1eba11bb0c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1189 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next set up tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner
+		 * node read out all of those tuples, and then come back around to
+		 * find another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (node->finished ||
+			tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL))
+			/*
+			 * TODO: there isn't a good test case for the node->finished case
+			 * directly, but lots of other stuff fails if it's not there. If
+			 * the outer node will fail when trying to fetch too many tuples,
+			 * then things break if that test isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() pulled the one of those groups
+			 * out, and we've returned those tuples to the inner node, but if
+			 * we tuples remaining in that tuplesort (i.e.,
+			 * n_fullsort_remaining > 0) at this point we need to do that
+			 * again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're
+			 * not in the middle of transitioning into presorted prefix sort
+			 * mode, then it's time to start the process all over again by
+			 * building new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort
+		 * here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us any more tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then
+				 * don't both with checking for inclusion in the current
+				 * prefix group since a large number of very tiny sorts is
+				 * inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this sort
+					 * group. Instead we need to carry it over to the next
+					 * group. We use the group_pivot slot as a temp container
+					 * for that purpose even though we won't actually treat it
+					 * as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for use in
+						 * configuring sorting bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found a
+			 * large group of tuples having a single prefix key (as long as
+			 * the last tuple didn't shift us into reading from the full sort
+			 * mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the
+				 * fullsort to presorted prefix sort (we might have multiple
+				 * prefix key groups, so we need a way to see if we've
+				 * actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've
+		 * already established a current group pivot tuple (but wasn't carried
+		 * over; it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix
+				 * tuplesort until we find changed prefix keys. Only then can
+				 * we guarantee sort stability of the tuples we've already
+				 * accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->finished = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721..f73d0782f5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..74799cd8fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none. Any
+	 * leading common pathkeys could be useful for ordering because we can use
+	 * the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..1d7d4eb3e7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314..55fe2a935c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4868,7 +4868,7 @@ create_distinct_paths(PlannerInfo *root,
 	else
 	{
 		Size		hashentrysize = hash_agg_entry_size(
-			0, cheapest_input_path->pathtarget->width, 0);
+														0, cheapest_input_path->pathtarget->width, 0);
 
 		/* Allow hashing only if hashtable is predicted to fit in work_mem */
 		allow_hash = (hashentrysize * numDistinctRows <= work_mem * 1024L);
@@ -4924,8 +4924,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4964,29 +4964,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3e836e6e1c..11e6fce9d1 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2741,6 +2741,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c6d648662..4949ef2079 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..c4eb90c196 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -647,6 +665,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +702,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
@@ -691,13 +712,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
 	/*
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
@@ -715,7 +744,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +769,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
 	state->tuplecontext = tuplecontext;
+	state->maincontext = maincontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -748,9 +778,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+	state->memtupsize = INITIAL_MEMTUPSIZE;
 	state->growmemtuples = true;
 	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
@@ -814,7 +842,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +918,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1013,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1091,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1134,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1251,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1324,111 @@ tuplesort_end(Tuplesortstate *state)
 	 * Free the per-sort memory context, thereby releasing all working memory,
 	 * including the Tuplesortstate struct itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+	state->availMem = state->allowedMem;
+	state->lastReturnedTuple = NULL;
+	state->slabAllocatorUsed = false;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
+	state->growmemtuples = true;
+
+	if (state->memtupsize < INITIAL_MEMTUPSIZE)
+	{
+		if (state->memtuples)
+			pfree(state->memtuples);
+		state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * sizeof(SortTuple));
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2725,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2775,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3272,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f..9ef4407ead 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2023,69 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		finished;		/* fetching tuples from outer node is finished
+								 * ? */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..fe4046b64b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba198..34f18bd73a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..7892b111d7
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1320 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+                                           QUERY PLAN                                            
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: 27kB (avg), 27kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ [                                                                +
+   {                                                              +
+     "Plan": {                                                    +
+       "Node Type": "Limit",                                      +
+       "Parallel Aware": false,                                   +
+       "Actual Rows": 55,                                         +
+       "Actual Loops": 1,                                         +
+       "Plans": [                                                 +
+         {                                                        +
+           "Node Type": "Incremental Sort",                       +
+           "Parent Relationship": "Outer",                        +
+           "Parallel Aware": false,                               +
+           "Actual Rows": 55,                                     +
+           "Actual Loops": 1,                                     +
+           "Sort Key": ["t.a", "t.b"],                            +
+           "Presorted Key": ["t.a"],                              +
+           "Full-sort Groups": {                                  +
+             "Group Count": 2,                                    +
+             "Sort Methods Used": ["quicksort", "top-N heapsort"],+
+             "Average Sort Space Used": 27,                       +
+             "Maximum Sort Space Used": 27,                       +
+             "Sort Space Type": "Memory"                          +
+           },                                                     +
+           "Plans": [                                             +
+             {                                                    +
+               "Node Type": "Sort",                               +
+               "Parent Relationship": "Outer",                    +
+               "Parallel Aware": false,                           +
+               "Actual Rows": 100,                                +
+               "Actual Loops": 1,                                 +
+               "Sort Key": ["t.a"],                               +
+               "Sort Method": "quicksort",                        +
+               "Sort Space Used": 30,                             +
+               "Sort Space Type": "Memory",                       +
+               "Plans": [                                         +
+                 {                                                +
+                   "Node Type": "Seq Scan",                       +
+                   "Parent Relationship": "Outer",                +
+                   "Parallel Aware": false,                       +
+                   "Relation Name": "t",                          +
+                   "Alias": "t",                                  +
+                   "Actual Rows": 100,                            +
+                   "Actual Loops": 1                              +
+                 }                                                +
+               ]                                                  +
+             }                                                    +
+           ]                                                      +
+         }                                                        +
+       ]                                                          +
+     },                                                           +
+     "Triggers": [                                                +
+     ]                                                            +
+   }                                                              +
+ ]
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: 28kB (avg), 28kB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: 26kB (avg), 30kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+                    QUERY PLAN                     
+---------------------------------------------------
+ [                                                +
+   {                                              +
+     "Plan": {                                    +
+       "Node Type": "Limit",                      +
+       "Parallel Aware": false,                   +
+       "Actual Rows": 70,                         +
+       "Actual Loops": 1,                         +
+       "Plans": [                                 +
+         {                                        +
+           "Node Type": "Incremental Sort",       +
+           "Parent Relationship": "Outer",        +
+           "Parallel Aware": false,               +
+           "Actual Rows": 70,                     +
+           "Actual Loops": 1,                     +
+           "Sort Key": ["t.a", "t.b"],            +
+           "Presorted Key": ["t.a"],              +
+           "Full-sort Groups": {                  +
+             "Group Count": 1,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 28,       +
+             "Maximum Sort Space Used": 28,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Presorted Groups": {                  +
+             "Group Count": 5,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 26,       +
+             "Maximum Sort Space Used": 30,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Plans": [                             +
+             {                                    +
+               "Node Type": "Sort",               +
+               "Parent Relationship": "Outer",    +
+               "Parallel Aware": false,           +
+               "Actual Rows": 100,                +
+               "Actual Loops": 1,                 +
+               "Sort Key": ["t.a"],               +
+               "Sort Method": "quicksort",        +
+               "Sort Space Used": 30,             +
+               "Sort Space Type": "Memory",       +
+               "Plans": [                         +
+                 {                                +
+                   "Node Type": "Seq Scan",       +
+                   "Parent Relationship": "Outer",+
+                   "Parallel Aware": false,       +
+                   "Relation Name": "t",          +
+                   "Alias": "t",                  +
+                   "Actual Rows": 100,            +
+                   "Actual Loops": 1              +
+                 }                                +
+               ]                                  +
+             }                                    +
+           ]                                      +
+         }                                        +
+       ]                                          +
+     },                                           +
+     "Triggers": [                                +
+     ]                                            +
+   }                                              +
+ ]
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..9320a10b91
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,88 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.20.1 (Apple Git-117)

#202

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#198)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Mar 13, 2020 at 01:16:44PM -0400, James Coleman wrote:

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...
Now, a couple comments about parts 0001 - 0003 of the patch ...

1) I see a bunch of failures in the regression test, due to minor
differences in the explain output. All the differences are about minor
changes in memory usage, like this:
-               "Sort Space Used": 30,                             +
+               "Sort Space Used": 29,                             +
I'm not sure if it happens on my machine only, but maybe the test is not
entirely stable.
make check passes on multiple machines for me; what arch/distro are you using?

Nothing exotic - Fedora on x64. My guess is that things like enabling
asserts will make a difference too, because then we track additional
stuff for allocated chunks etc. So I agree with Tom trying to keep this
stable is a lost cause, essentially.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#203

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#194)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

3) Most of the execution plans look reasonable, except that some of the
plans look like this:

QUERY PLAN
---------------------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Incremental Sort
Sort Key: t.a, t.b, t.c, t.d
Presorted Key: t.a, t.b, t.c
-> Incremental Sort
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Index Scan using t_a_b_idx on t
(10 rows)

i.e. there are two incremental sorts on top of each other, with
different prefixes. But this this is not a new issue - it happens with
queries like this:

SELECT a, b, c, d, count(*) FROM (
SELECT * FROM t ORDER BY a, b, c
) foo GROUP BY a, b, c, d limit 1000;

i.e. there's a subquery with a subset of pathkeys. Without incremental
sort the plan looks like this:

QUERY PLAN
---------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c
-> Seq Scan on t
(8 rows)

so essentially the same plan shape. What bugs me though is that there
seems to be some sort of memory leak, so that this query consumes
gigabytes os RAM before it gets killed by OOM. But the memory seems not
to be allocated in any memory context (at least MemoryContextStats don't
show anything like that), so I'm not sure what's going on.

Reproducing it is fairly simple:

CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
INSERT INTO t SELECT
1000*random(), 1000*random(), 1000*random(), 1000*random()
FROM generate_series(1,10000000) s(i);
CREATE INDEX idx ON t(a,b);
ANALYZE t;

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.

James

#204

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#197)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 2020-Mar-13, James Coleman wrote:

I don't propose to commit 0003 of course, since it's not our policy;
that's just to allow running pgindent sanely, which gives you 0004
(though my local pgindent has an unrelated fix). And after that you
notice the issue that 0005 fixes.

Is there a page on how you're supposed to run pgindent/when stuff like
this does get added/etc.? It's all a big mystery to me right now.

Also, I noticed some of the pgindent changes aren't for changes in
this patch series; I have that as a separate patch, but not attached
because I see that running pgindent locally generates a massive patch,
so I'm assuming we just ignore those for now?

Ah, I should have paid more attention to what I was attaching. Yeah,
ideally you run pgindent and then only include the changes that are
relevant to your patch series. We run pgindent across the whole tree
once every release, so about yearly. Some commits go in that are not
indented correctly, and those bother everyone -- I'm guilty of this
myself more frequently than I'd like.

You can specify a filelist to pgindent, also. What I do is super
low-tech: do a "git diff origin/master", copy the filelist, and then
^V^E to paste that list into a command line to run pgindent (editing to
remove the change histogram and irrelevant files). I should automate
this ...

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#205

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Alvaro Herrera (#204)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

... You can specify a filelist to pgindent, also. What I do is super
low-tech: do a "git diff origin/master", copy the filelist, and then
^V^E to paste that list into a command line to run pgindent (editing to
remove the change histogram and irrelevant files). I should automate
this ...

Yeah. I tend to keep copies of the files I'm specifically hacking on
in a separate work directory, and then I re-indent just that directory.
But that's far from ideal as well. I wonder if it'd be worth teaching
pgindent to have some option to indent only files that are already
modified according to git?

regards, tom lane

#206

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#205)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 2020-Mar-13, Tom Lane wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

... You can specify a filelist to pgindent, also. What I do is super
low-tech: do a "git diff origin/master", copy the filelist, and then
^V^E to paste that list into a command line to run pgindent (editing to
remove the change histogram and irrelevant files). I should automate
this ...

Yeah. I tend to keep copies of the files I'm specifically hacking on
in a separate work directory, and then I re-indent just that directory.
But that's far from ideal as well. I wonder if it'd be worth teaching
pgindent to have some option to indent only files that are already
modified according to git?

A quick look at git-ls-files manpage suggests that this might work:

src/tools/pgindent/pgindent $(git ls-files --modified -- *.[ch])

If it's that easy, maybe it's not worth messing with pgindent ...

Also, I wonder if it would be better to modify our policies so that we
update typedefs.list more frequently. Some people include additions
with their commits, but it's far from SOP.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#207

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Alvaro Herrera (#206)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Also, I wonder if it would be better to modify our policies so that we
update typedefs.list more frequently. Some people include additions
with their commits, but it's far from SOP.

Perhaps. My own workflow includes pulling down a fresh typedefs.list
from the buildfarm (which is trivial to automate) and then adding any
typedefs invented by the patch I'm working on. The latter part of it
makes it hard to see how the in-tree list would be very helpful; and
if we started expecting patches to include typedef updates, I'm afraid
we'd get lots of patch collisions in that file.

I don't have any big objection to updating the in-tree list more often,
but personally I wouldn't use it, unless we can find a better workflow.

regards, tom lane

#208

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#203)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

3) Most of the execution plans look reasonable, except that some of the
plans look like this:

QUERY PLAN
---------------------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Incremental Sort
Sort Key: t.a, t.b, t.c, t.d
Presorted Key: t.a, t.b, t.c
-> Incremental Sort
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Index Scan using t_a_b_idx on t
(10 rows)

i.e. there are two incremental sorts on top of each other, with
different prefixes. But this this is not a new issue - it happens with
queries like this:

SELECT a, b, c, d, count(*) FROM (
SELECT * FROM t ORDER BY a, b, c
) foo GROUP BY a, b, c, d limit 1000;

i.e. there's a subquery with a subset of pathkeys. Without incremental
sort the plan looks like this:

QUERY PLAN
---------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c
-> Seq Scan on t
(8 rows)

so essentially the same plan shape. What bugs me though is that there
seems to be some sort of memory leak, so that this query consumes
gigabytes os RAM before it gets killed by OOM. But the memory seems not
to be allocated in any memory context (at least MemoryContextStats don't
show anything like that), so I'm not sure what's going on.

Reproducing it is fairly simple:

CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
INSERT INTO t SELECT
1000*random(), 1000*random(), 1000*random(), 1000*random()
FROM generate_series(1,10000000) s(i);
CREATE INDEX idx ON t(a,b);
ANALYZE t;

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.

And, without the EXPLAIN ANALYZE was able to get this one, which will
probably be a lot more helpful.

James

#209

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#208)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:

On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

3) Most of the execution plans look reasonable, except that some of the
plans look like this:

QUERY PLAN
---------------------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Incremental Sort
Sort Key: t.a, t.b, t.c, t.d
Presorted Key: t.a, t.b, t.c
-> Incremental Sort
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Index Scan using t_a_b_idx on t
(10 rows)

i.e. there are two incremental sorts on top of each other, with
different prefixes. But this this is not a new issue - it happens with
queries like this:

SELECT a, b, c, d, count(*) FROM (
SELECT * FROM t ORDER BY a, b, c
) foo GROUP BY a, b, c, d limit 1000;

i.e. there's a subquery with a subset of pathkeys. Without incremental
sort the plan looks like this:

QUERY PLAN
---------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c
-> Seq Scan on t
(8 rows)

so essentially the same plan shape. What bugs me though is that there
seems to be some sort of memory leak, so that this query consumes
gigabytes os RAM before it gets killed by OOM. But the memory seems not
to be allocated in any memory context (at least MemoryContextStats don't
show anything like that), so I'm not sure what's going on.

Reproducing it is fairly simple:

CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
INSERT INTO t SELECT
1000*random(), 1000*random(), 1000*random(), 1000*random()
FROM generate_series(1,10000000) s(i);
CREATE INDEX idx ON t(a,b);
ANALYZE t;

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.

And, without the EXPLAIN ANALYZE was able to get this one, which will
probably be a lot more helpful.

Hmmm, I'll try reproducing it, but can you investigate the values in the
Assert? I mean, it fails on this:

Assert(total_allocated == context->mem_allocated);

so can you get a core or attach to the process using gdb, and see what's
the expected / total value?

BTW, I might have copied the wrong query - can you try with a higher
value in the LIMIT clause? For example:

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 1000000;

I think this might be the differenc ewhy you don't see the memory leak.
Or maybe it was because of asserts? I'm not sure if I had enabled them
in the build ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#210

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#209)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Friday, March 13, 2020, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:

On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

3) Most of the execution plans look reasonable, except that some of the
plans look like this:

QUERY PLAN
---------------------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Incremental Sort
Sort Key: t.a, t.b, t.c, t.d
Presorted Key: t.a, t.b, t.c
-> Incremental Sort
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Index Scan using t_a_b_idx on t
(10 rows)

i.e. there are two incremental sorts on top of each other, with
different prefixes. But this this is not a new issue - it happens with
queries like this:

SELECT a, b, c, d, count(*) FROM (
SELECT * FROM t ORDER BY a, b, c
) foo GROUP BY a, b, c, d limit 1000;

i.e. there's a subquery with a subset of pathkeys. Without incremental
sort the plan looks like this:

QUERY PLAN
---------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c
-> Seq Scan on t
(8 rows)

so essentially the same plan shape. What bugs me though is that there
seems to be some sort of memory leak, so that this query consumes
gigabytes os RAM before it gets killed by OOM. But the memory seems not
to be allocated in any memory context (at least MemoryContextStats

don't

show anything like that), so I'm not sure what's going on.

Reproducing it is fairly simple:

CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
INSERT INTO t SELECT
1000*random(), 1000*random(), 1000*random(), 1000*random()
FROM generate_series(1,10000000) s(i);
CREATE INDEX idx ON t(a,b);
ANALYZE t;

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.

And, without the EXPLAIN ANALYZE was able to get this one, which will
probably be a lot more helpful.

Hmmm, I'll try reproducing it, but can you investigate the values in the
Assert? I mean, it fails on this:

Assert(total_allocated == context->mem_allocated);

so can you get a core or attach to the process using gdb, and see what's
the expected / total value?

BTW, I might have copied the wrong query - can you try with a higher
value in the LIMIT clause? For example:

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 1000000;

I think this might be the differenc ewhy you don't see the memory leak.
Or maybe it was because of asserts? I'm not sure if I had enabled them
in the build ...

I’m not at my laptop right now, but I’ve started looking at it, but I
haven’t figured it out yet. Going from memory, it had allocated 16384 but
expected 8192 (I think I have the order of that right).

It’s very consistently reproducible, thankfully, but doesn’t always happen
on the first query; IIRC always the 2nd with LIMIT 100, and I could get it
to happen with first at 96 and second at 97, but repeating 96 many times
didn’t seem to trigger it.

I’m hoping it’s the same root cause as the memory leak, but unsure.

I’ll try a higher number when I get a chance.

James

#211

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#210)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Mar 13, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:

On Friday, March 13, 2020, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:

On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

3) Most of the execution plans look reasonable, except that some of the
plans look like this:

QUERY PLAN
---------------------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Incremental Sort
Sort Key: t.a, t.b, t.c, t.d
Presorted Key: t.a, t.b, t.c
-> Incremental Sort
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Index Scan using t_a_b_idx on t
(10 rows)

i.e. there are two incremental sorts on top of each other, with
different prefixes. But this this is not a new issue - it happens with
queries like this:

SELECT a, b, c, d, count(*) FROM (
SELECT * FROM t ORDER BY a, b, c
) foo GROUP BY a, b, c, d limit 1000;

i.e. there's a subquery with a subset of pathkeys. Without incremental
sort the plan looks like this:

QUERY PLAN
---------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c
-> Seq Scan on t
(8 rows)

so essentially the same plan shape. What bugs me though is that there
seems to be some sort of memory leak, so that this query consumes
gigabytes os RAM before it gets killed by OOM. But the memory seems not
to be allocated in any memory context (at least MemoryContextStats don't
show anything like that), so I'm not sure what's going on.

Reproducing it is fairly simple:

CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
INSERT INTO t SELECT
1000*random(), 1000*random(), 1000*random(), 1000*random()
FROM generate_series(1,10000000) s(i);
CREATE INDEX idx ON t(a,b);
ANALYZE t;

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.

And, without the EXPLAIN ANALYZE was able to get this one, which will
probably be a lot more helpful.

Hmmm, I'll try reproducing it, but can you investigate the values in the
Assert? I mean, it fails on this:

Assert(total_allocated == context->mem_allocated);

so can you get a core or attach to the process using gdb, and see what's
the expected / total value?

I've reproduced this on multiple machines (though all are Ubuntu or
Debian derivatives...I don't think that's likely to matter). A core
dump is ~150MB, so I've uploaded to Dropbox [1]https://www.dropbox.com/s/jwndwp4634hzywk/aset_assertion_failure.core?dl=0.

I didn't find an obvious first-level member of Tuplesortstate that was
covered by either of the two blocks in the AllocSet (both are 8KB in
size).

James

[1]: https://www.dropbox.com/s/jwndwp4634hzywk/aset_assertion_failure.core?dl=0

#212

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#211)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 14, 2020 at 12:07 PM James Coleman <jtc331@gmail.com> wrote:

On Fri, Mar 13, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:

On Friday, March 13, 2020, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:

On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

3) Most of the execution plans look reasonable, except that some of the
plans look like this:

QUERY PLAN
---------------------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Incremental Sort
Sort Key: t.a, t.b, t.c, t.d
Presorted Key: t.a, t.b, t.c
-> Incremental Sort
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Index Scan using t_a_b_idx on t
(10 rows)

i.e. there are two incremental sorts on top of each other, with
different prefixes. But this this is not a new issue - it happens with
queries like this:

SELECT a, b, c, d, count(*) FROM (
SELECT * FROM t ORDER BY a, b, c
) foo GROUP BY a, b, c, d limit 1000;

i.e. there's a subquery with a subset of pathkeys. Without incremental
sort the plan looks like this:

QUERY PLAN
---------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c
-> Seq Scan on t
(8 rows)

so essentially the same plan shape. What bugs me though is that there
seems to be some sort of memory leak, so that this query consumes
gigabytes os RAM before it gets killed by OOM. But the memory seems not
to be allocated in any memory context (at least MemoryContextStats don't
show anything like that), so I'm not sure what's going on.

Reproducing it is fairly simple:

CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
INSERT INTO t SELECT
1000*random(), 1000*random(), 1000*random(), 1000*random()
FROM generate_series(1,10000000) s(i);
CREATE INDEX idx ON t(a,b);
ANALYZE t;

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.

And, without the EXPLAIN ANALYZE was able to get this one, which will
probably be a lot more helpful.

Hmmm, I'll try reproducing it, but can you investigate the values in the
Assert? I mean, it fails on this:

Assert(total_allocated == context->mem_allocated);

so can you get a core or attach to the process using gdb, and see what's
the expected / total value?

I've reproduced this on multiple machines (though all are Ubuntu or
Debian derivatives...I don't think that's likely to matter). A core
dump is ~150MB, so I've uploaded to Dropbox [1].

I didn't find an obvious first-level member of Tuplesortstate that was
covered by either of the two blocks in the AllocSet (both are 8KB in
size).

James

[1]: https://www.dropbox.com/s/jwndwp4634hzywk/aset_assertion_failure.core?dl=0

And...I think I might have found out the issue (though haven't proved
it 100% yet or fixed it):

The incremental sort node calls `tuplesort_puttupleslot`, which
switches the memory context to `sortcontext`. It then calls
`puttuple_common`. `puttuple_common` may then call `grow_memtuples`
which reallocs space for `sortstate->memtuples`, but `memtuples` is
elsewhere allocated in the memory context maincontext.

I had earlier in this debugging process noticed that `sortcontext` was
allocated in `maincontext`, which seemed conceptually odd if our goal
is to reuse the sort state, and I also found a comment that needed to
be changed relative to cleaning up the per-sort context (that talks
about it freeing the sort state itself), but the `memtuples` array was
in fact freed additionally at reset, so it seemed safe.

Given this issue though, I think I'm going to go ahead and rework so
that the `memtuples` array lies within the `sortcontext` instead.

James

#213

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#212)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 14, 2020 at 12:24 PM James Coleman <jtc331@gmail.com> wrote:

On Sat, Mar 14, 2020 at 12:07 PM James Coleman <jtc331@gmail.com> wrote:

On Fri, Mar 13, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:

On Friday, March 13, 2020, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:

On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

3) Most of the execution plans look reasonable, except that some of the
plans look like this:

QUERY PLAN
---------------------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Incremental Sort
Sort Key: t.a, t.b, t.c, t.d
Presorted Key: t.a, t.b, t.c
-> Incremental Sort
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Index Scan using t_a_b_idx on t
(10 rows)

i.e. there are two incremental sorts on top of each other, with
different prefixes. But this this is not a new issue - it happens with
queries like this:

SELECT a, b, c, d, count(*) FROM (
SELECT * FROM t ORDER BY a, b, c
) foo GROUP BY a, b, c, d limit 1000;

i.e. there's a subquery with a subset of pathkeys. Without incremental
sort the plan looks like this:

QUERY PLAN
---------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c
-> Seq Scan on t
(8 rows)

so essentially the same plan shape. What bugs me though is that there
seems to be some sort of memory leak, so that this query consumes
gigabytes os RAM before it gets killed by OOM. But the memory seems not
to be allocated in any memory context (at least MemoryContextStats don't
show anything like that), so I'm not sure what's going on.

Reproducing it is fairly simple:

CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
INSERT INTO t SELECT
1000*random(), 1000*random(), 1000*random(), 1000*random()
FROM generate_series(1,10000000) s(i);
CREATE INDEX idx ON t(a,b);
ANALYZE t;

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.

And, without the EXPLAIN ANALYZE was able to get this one, which will
probably be a lot more helpful.

Hmmm, I'll try reproducing it, but can you investigate the values in the
Assert? I mean, it fails on this:

Assert(total_allocated == context->mem_allocated);

so can you get a core or attach to the process using gdb, and see what's
the expected / total value?

I've reproduced this on multiple machines (though all are Ubuntu or
Debian derivatives...I don't think that's likely to matter). A core
dump is ~150MB, so I've uploaded to Dropbox [1].

I didn't find an obvious first-level member of Tuplesortstate that was
covered by either of the two blocks in the AllocSet (both are 8KB in
size).

James

[1]: https://www.dropbox.com/s/jwndwp4634hzywk/aset_assertion_failure.core?dl=0

And...I think I might have found out the issue (though haven't proved
it 100% yet or fixed it):

The incremental sort node calls `tuplesort_puttupleslot`, which
switches the memory context to `sortcontext`. It then calls
`puttuple_common`. `puttuple_common` may then call `grow_memtuples`
which reallocs space for `sortstate->memtuples`, but `memtuples` is
elsewhere allocated in the memory context maincontext.

I had earlier in this debugging process noticed that `sortcontext` was
allocated in `maincontext`, which seemed conceptually odd if our goal
is to reuse the sort state, and I also found a comment that needed to
be changed relative to cleaning up the per-sort context (that talks
about it freeing the sort state itself), but the `memtuples` array was
in fact freed additionally at reset, so it seemed safe.

Given this issue though, I think I'm going to go ahead and rework so
that the `memtuples` array lies within the `sortcontext` instead.

Perhaps I spoke too soon: I didn't realize repalloc(_huge) didn't need
a memory context switch, so this likely isn't the issue.

James

#214

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#213)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 14, 2020 at 12:36 PM James Coleman <jtc331@gmail.com> wrote:

On Sat, Mar 14, 2020 at 12:24 PM James Coleman <jtc331@gmail.com> wrote:

On Sat, Mar 14, 2020 at 12:07 PM James Coleman <jtc331@gmail.com> wrote:

On Fri, Mar 13, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:

On Friday, March 13, 2020, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:

On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

3) Most of the execution plans look reasonable, except that some of the
plans look like this:

QUERY PLAN
---------------------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Incremental Sort
Sort Key: t.a, t.b, t.c, t.d
Presorted Key: t.a, t.b, t.c
-> Incremental Sort
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
-> Index Scan using t_a_b_idx on t
(10 rows)

i.e. there are two incremental sorts on top of each other, with
different prefixes. But this this is not a new issue - it happens with
queries like this:

SELECT a, b, c, d, count(*) FROM (
SELECT * FROM t ORDER BY a, b, c
) foo GROUP BY a, b, c, d limit 1000;

i.e. there's a subquery with a subset of pathkeys. Without incremental
sort the plan looks like this:

QUERY PLAN
---------------------------------------------
Limit
-> GroupAggregate
Group Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c, t.d
-> Sort
Sort Key: t.a, t.b, t.c
-> Seq Scan on t
(8 rows)

so essentially the same plan shape. What bugs me though is that there
seems to be some sort of memory leak, so that this query consumes
gigabytes os RAM before it gets killed by OOM. But the memory seems not
to be allocated in any memory context (at least MemoryContextStats don't
show anything like that), so I'm not sure what's going on.

Reproducing it is fairly simple:

CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
INSERT INTO t SELECT
1000*random(), 1000*random(), 1000*random(), 1000*random()
FROM generate_series(1,10000000) s(i);
CREATE INDEX idx ON t(a,b);
ANALYZE t;

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.

And, without the EXPLAIN ANALYZE was able to get this one, which will
probably be a lot more helpful.

Hmmm, I'll try reproducing it, but can you investigate the values in the
Assert? I mean, it fails on this:

Assert(total_allocated == context->mem_allocated);

so can you get a core or attach to the process using gdb, and see what's
the expected / total value?

I've reproduced this on multiple machines (though all are Ubuntu or
Debian derivatives...I don't think that's likely to matter). A core
dump is ~150MB, so I've uploaded to Dropbox [1].

I didn't find an obvious first-level member of Tuplesortstate that was
covered by either of the two blocks in the AllocSet (both are 8KB in
size).

James

[1]: https://www.dropbox.com/s/jwndwp4634hzywk/aset_assertion_failure.core?dl=0

And...I think I might have found out the issue (though haven't proved
it 100% yet or fixed it):

The incremental sort node calls `tuplesort_puttupleslot`, which
switches the memory context to `sortcontext`. It then calls
`puttuple_common`. `puttuple_common` may then call `grow_memtuples`
which reallocs space for `sortstate->memtuples`, but `memtuples` is
elsewhere allocated in the memory context maincontext.

I had earlier in this debugging process noticed that `sortcontext` was
allocated in `maincontext`, which seemed conceptually odd if our goal
is to reuse the sort state, and I also found a comment that needed to
be changed relative to cleaning up the per-sort context (that talks
about it freeing the sort state itself), but the `memtuples` array was
in fact freed additionally at reset, so it seemed safe.

Given this issue though, I think I'm going to go ahead and rework so
that the `memtuples` array lies within the `sortcontext` instead.

Perhaps I spoke too soon: I didn't realize repalloc(_huge) didn't need
a memory context switch, so this likely isn't the issue.

It looks like the issue is actually into the `tuplecontext`, which is
currently a child context of `sortcontext`:

#3 0x0000558cd153b565 in AllocSetCheck
(context=context@entry=0x558cd28e0b70) at aset.c:1573
1573 Assert(total_allocated == context->mem_allocated);
(gdb) p total_allocated
$1 = 16384
(gdb) p context->mem_allocated
$2 = 8192
(gdb) p context->name
$3 = 0x558cd16c8ccd "Caller tuples"

I stuck in several more AllocSetCheck calls in aset.c and got the
attached backtrace.

James

#215

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#214)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 14, 2020 at 02:41:09PM -0400, James Coleman wrote:

It looks like the issue is actually into the `tuplecontext`, which is
currently a child context of `sortcontext`:

#3 0x0000558cd153b565 in AllocSetCheck
(context=context@entry=0x558cd28e0b70) at aset.c:1573
1573 Assert(total_allocated == context->mem_allocated);
(gdb) p total_allocated
$1 = 16384
(gdb) p context->mem_allocated
$2 = 8192
(gdb) p context->name
$3 = 0x558cd16c8ccd "Caller tuples"

I stuck in several more AllocSetCheck calls in aset.c and got the
attached backtrace.

I think the problem is pretty simple - tuplesort_reset does call
tuplesort_reset, which resets the sortcontext. But that *deletes* the
tuplecontext, so the state->tuplecontext gets stale. I'd haven't looked
into the exact details, but it clearly confuses the accouting.

The attached patch fixes the issue for me - I'm not claiming it's the
right fix, but it's the simplest thing I could think of. Maybe the
tuplesort_rest should work differently, not sure.

And it seems to resolve the memory leak too - I suspect we've freed the
context (so it was not part of the tree of contexts) but the struct was
still valid and we kept allocating memory in it - but it was invisible
to MemoryContextDump etc.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

tuplesort-fix.patchtext/plain; charset=us-asciiDownload

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index c2bd38f39f..91c0189577 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1419,6 +1419,10 @@ tuplesort_reset(Tuplesortstate *state)
 	state->slabFreeHead = NULL;
 	state->growmemtuples = true;
 
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+										 "Caller tuples",
+										 ALLOCSET_DEFAULT_SIZES);
+
 	if (state->memtupsize < INITIAL_MEMTUPSIZE)
 	{
 		if (state->memtuples)

#216

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#197)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Mar 13, 2020 at 1:06 PM James Coleman <jtc331@gmail.com> wrote:

On Thu, Mar 12, 2020 at 5:53 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I gave this a very quick look; I don't claim to understand it or
anything, but I thought these trivial cleanups worthwhile. The only
non-cosmetic thing is changing order of arguments to the SOn_printf()
calls in 0008; I think they are contrary to what the comment says.

Yes, I think you're correct (re: 0008).

They all look generally good to me, and are included in the attached
patch series.

I just realized something about this (unsure if in Alvaro's or in my
applying that) broke make check pretty decently (3 test files broken,
also much slower, and the incremental sort test returns a lot of
obviously broken results).

I'll take a look tomorrow and hopefully get a fix (probably will reply
to the more recent subthread's though).

James

#217

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#216)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 14, 2020 at 10:55 PM James Coleman <jtc331@gmail.com> wrote:

On Fri, Mar 13, 2020 at 1:06 PM James Coleman <jtc331@gmail.com> wrote:

On Thu, Mar 12, 2020 at 5:53 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I gave this a very quick look; I don't claim to understand it or
anything, but I thought these trivial cleanups worthwhile. The only
non-cosmetic thing is changing order of arguments to the SOn_printf()
calls in 0008; I think they are contrary to what the comment says.

Yes, I think you're correct (re: 0008).

They all look generally good to me, and are included in the attached
patch series.

I just realized something about this (unsure if in Alvaro's or in my
applying that) broke make check pretty decently (3 test files broken,
also much slower, and the incremental sort test returns a lot of
obviously broken results).

I'll take a look tomorrow and hopefully get a fix (probably will reply
to the more recent subthread's though).

This took a bit of manually just excluding changes until I got a
red/green set because nothing in the patch set looked all incorrect.
But it turns out this change breaks things:

- if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
-                                                     false, slot,
NULL) || node->finished)
+ if (node->finished ||
+          tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+                                                     false, slot, NULL))

I believe what's happening here is that we need the
tuplesort_gettupleslot to set the slot to a NULL tuple (if there
aren't any left) before we return the slot, but that returns false, so
the node->finished check is to ensure that the first time that method
nulls out the slot and returns false we still return the value in
slot.

Since this isn't obvious, I'll add a comment. I think I'm also going
to rename node->finished to be more clear.

In this debugging I also noticed that we don't set node->finished back
to false in rescan, which I assume is a bug, but I don't really
understand a whole lot about rescan. IIRC from some previous
discussions rescan exists for things like cursors, where you can move
back and forth over the result set. Assuming that's the case, do we
need explicit tests for cursors using incremental sort? Is there a
good strategy for how much to do there (since I don't want to
duplicate every non-cursor functional test).

Thanks,
James

#218

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tom Lane (#207)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Mar 13, 2020 at 4:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Also, I wonder if it would be better to modify our policies so that we
update typedefs.list more frequently. Some people include additions
with their commits, but it's far from SOP.

Perhaps. My own workflow includes pulling down a fresh typedefs.list
from the buildfarm (which is trivial to automate) and then adding any
typedefs invented by the patch I'm working on. The latter part of it
makes it hard to see how the in-tree list would be very helpful; and
if we started expecting patches to include typedef updates, I'm afraid
we'd get lots of patch collisions in that file.

I don't have any big objection to updating the in-tree list more often,
but personally I wouldn't use it, unless we can find a better workflow.

How does the buildfarm automate generating the typedefs list? Would it
be relatively easy to incorporate that into a tool that someone could
use locally with pgindent?

Thanks,
James

#219

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#215)

4 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 14, 2020 at 3:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 14, 2020 at 02:41:09PM -0400, James Coleman wrote:

It looks like the issue is actually into the `tuplecontext`, which is
currently a child context of `sortcontext`:

#3 0x0000558cd153b565 in AllocSetCheck
(context=context@entry=0x558cd28e0b70) at aset.c:1573
1573 Assert(total_allocated == context->mem_allocated);
(gdb) p total_allocated
$1 = 16384
(gdb) p context->mem_allocated
$2 = 8192
(gdb) p context->name
$3 = 0x558cd16c8ccd "Caller tuples"

I stuck in several more AllocSetCheck calls in aset.c and got the
attached backtrace.

I think the problem is pretty simple - tuplesort_reset does call
tuplesort_reset, which resets the sortcontext. But that *deletes* the
tuplecontext, so the state->tuplecontext gets stale. I'd haven't looked
into the exact details, but it clearly confuses the accouting.

The attached patch fixes the issue for me - I'm not claiming it's the
right fix, but it's the simplest thing I could think of. Maybe the
tuplesort_rest should work differently, not sure.

And it seems to resolve the memory leak too - I suspect we've freed the
context (so it was not part of the tree of contexts) but the struct was
still valid and we kept allocating memory in it - but it was invisible
to MemoryContextDump etc.

Thanks for tracking that down!

I wondered at first if we should consider making the tuplecontext a
child context of the main context instead, but the allocations it
contains will always live at most the length of the contents of the
sortcontext, so I think that is probably fine as is.

This issue, and the resulting fix, did however make me think that we
have some duplication here that's likely to lead to bugs now and/or
later. I've reworked things a bit so that both (the added for
incremental sort) tuplesort_reset and (the existing)
tuplesort_begin_common now both call out to tuplesort_begin_batch so
configure initial starting state. This way it should be more obvious
that there are both cases to consider if new initialization code is
being added to tuplesort.c.

Working on this refactor I noticed it seems that we were only
reallocing the memtuples array if it was smaller (not sure this is
even possible?) than the initial size, which means if it had been
resized significantly we were keeping that memory tied up for
subsequent batches even if not needed. I suppose one could argue
that's helpful in the sense we don't need to keep increasing its size
on each batch...but it seems to me to be not very clean, so I've
changed that.

I also noticed that we had the USEMEM macro used in tuplesort_reset
regardless of whether or not the the memtuples array had been
realloced, which seems wrong (I don't think we were resetting the
stats), so I changed that too.

I'm still not sure if we should keep the memtuples array in
maincontext (where it is now) or sortcontext. The only argument I can
see for the former is that it allows us a minor optimization: we we
haven't grown the array, we can reuse it for multiple batches. On the
other hand, always resetting it is conceptually more clear. I'm
curious to hear your thoughts on this.

Over at [1]/messages/by-id/CAAaqYe9BsrW=DRBOd9yW0s2djofXTM9mRpO=LhHrCu4qdGgrVg@mail.gmail.com I'd noticed that the patch series versions since my
incorporating Alvaro's pgindent/general formatting patches had been
failing make check. I noted there that I'd found the problem and was
fixing it, so this version of the patch includes that fix. Still (from
that sub-thread) need to figure out what we may or may not need to
update and test with rescan.

James

[1]: /messages/by-id/CAAaqYe9BsrW=DRBOd9yW0s2djofXTM9mRpO=LhHrCu4qdGgrVg@mail.gmail.com

Attachments:

v38-0004-A-couple-more-places-for-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v38-0004-A-couple-more-places-for-incremental-sort.patchDownload

From d05d5f2e545a9257b32a46544e1559cbbaab3ff9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH v38 4/4] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/plan/planner.c   | 220 ++++++++++++++++++++++++-
 2 files changed, 217 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4411fc515a..b92b65b543 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5070,6 +5070,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we
+				 * can't simply skip it, because it may be partially sorted in
+				 * which case we want to consider incremental sort on top of
+				 * it (instead of full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6570,12 +6631,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6606,6 +6673,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6875,6 +6992,60 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Also consider incremental sort on all partially sorted paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6939,10 +7110,10 @@ create_partial_grouping_paths(PlannerInfo *root,
 			/* We've already skipped fully sorted paths above. */
 			Assert(!is_sorted);
 
-			/* no shared prefix, not point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
+			/* Since we have presorted keys, consider incremental sort. */
 			path = (Path *) create_incremental_sort_path(root,
 														 partially_grouped_rel,
 														 path,
@@ -7067,10 +7238,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7096,6 +7268,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7197,7 +7409,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.17.1

v38-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v38-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 5252de9888e9e676d8dcb8efa840199633b85d9d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v38 1/4] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 47 ++++++++++-----------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d9ce516211..3e836e6e1c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -777,41 +777,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Unless pathkeys are incompatible, keep just one of the two paths. */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
-			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v38-0003-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v38-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From 2baea1f0f11bf6a188cf306c77ebe91639aff53a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v38 3/4] Consider incremental sort paths in additional places

---
 src/backend/optimizer/path/allpaths.c | 221 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  | 130 ++++++++++++++-
 src/include/optimizer/paths.h         |   2 +
 3 files changed, 350 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..9a92948fe3 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,223 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Pushing the query_pathkeys to the remote server is always worth
+	 * considering, because it might let us avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * The planner and executor don't have any clever strategy for
+			 * taking data sorted by a prefix of the query's pathkeys and
+			 * getting it to be sorted by all of those pathkeys. We'll just
+			 * end up re-sorting the entire data set.  So, unless we can push
+			 * down all of the query pathkeys, forget it.
+			 *
+			 * is_foreign_expr would detect volatile expressions as well, but
+			 * checking ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		/*
+		 * This ends up allowing us to do incremental sort on top of an index
+		 * scan all parallelized under a gather merge node.
+		 */
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike generate_gather_paths, this does not look only at pathkeys of the
+ * input paths (aiming to preserve the ordering). It also considers ordering
+ * that might be useful by nodes above the gather merge node, and tries to
+ * add a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather merge paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * consider regular sort for cheapest partial path (for each
+			 * useful pathkeys)
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/* Also consider incremental sort */
+			if (presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3116,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 55fe2a935c..4411fc515a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6424,7 +6424,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6483,6 +6485,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6807,7 +6883,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6842,6 +6920,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7223,7 +7351,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..f6994779de 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
-- 
2.17.1

v38-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v38-0002-Implement-incremental-sort.patchDownload

From 9a5b2c7e9ce3a893573bb9946903f444c99cf9f0 Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Fri, 27 Sep 2019 19:36:53 +0000
Subject: [PATCH v38 2/4] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   33 +
 src/backend/executor/nodeIncrementalSort.c    | 1201 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   73 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  296 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   77 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   11 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1320 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   88 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3852 insertions(+), 162 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 371d7838fb..64ea00f462 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4490,6 +4490,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..e73038b0cd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
@@ -1239,6 +1243,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1897,6 +1904,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2225,12 +2238,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2241,7 +2271,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2265,7 +2295,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2334,7 +2364,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2391,7 +2421,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2404,13 +2434,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2450,9 +2481,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2666,6 +2701,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const	   *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, sortMethodName);
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const	   *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const	   *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	if (!(es->analyze && incrsortstate->sort_Done))
+		return;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+	if (fullsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * XXX: The previous version of the patch chcked:
+			 * fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+			 * and continued if the condition was true (with the comment
+			 * "ignore any unfilled slots"). I'm not convinced that makes
+			 * sense since the same sort instrument can have been used
+			 * multiple times, so the last time it being used being still in
+			 * progress, doesn't seem to be relevant. Instead I'm now checking
+			 * to see if the group count for each group info is 0. If both are
+			 * 0, then we exclude the worker since it didn't contribute
+			 * anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..d15a86a706 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..32ce05a63c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1201 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next set up tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner
+		 * node read out all of those tuples, and then come back around to
+		 * find another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will NULL the slot if no more tuples
+		 * remain. If the tuplesort is empty, but we don't have any more
+		 * tuples avaialable for sort from the outer node, then outerNodeDone
+		 * will have been set so we'll return the empty slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * TODO: there isn't a good test case for the node->outerNodeDone
+			 * case directly, but lots of other stuff fails if it's not there.
+			 * If the outer node will fail when trying to fetch too many
+			 * tuples, then things break if this check isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the inner
+			 * node, but if we tuples remaining in that tuplesort (i.e.,
+			 * n_fullsort_remaining > 0) at this point we need to do that
+			 * again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're
+			 * not in the middle of transitioning into presorted prefix sort
+			 * mode, then it's time to start the process all over again by
+			 * building new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort
+		 * here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us any more tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then
+				 * don't both with checking for inclusion in the current
+				 * prefix group since a large number of very tiny sorts is
+				 * inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this sort
+					 * group. Instead we need to carry it over to the next
+					 * group. We use the group_pivot slot as a temp container
+					 * for that purpose even though we won't actually treat it
+					 * as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for use in
+						 * configuring sorting bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found a
+			 * large group of tuples having a single prefix key (as long as
+			 * the last tuple didn't shift us into reading from the full sort
+			 * mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the
+				 * fullsort to presorted prefix sort (we might have multiple
+				 * prefix key groups, so we need a way to see if we've
+				 * actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've
+		 * already established a current group pivot tuple (but wasn't carried
+		 * over; it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->outerNodeDone = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix
+				 * tuplesort until we find changed prefix keys. Only then can
+				 * we guarantee sort stability of the tuples we've already
+				 * accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * XXX: This is suspect.
+	 *
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	node->outerNodeDone = false;
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721..f73d0782f5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1783,8 +1768,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..74799cd8fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none. Any
+	 * leading common pathkeys could be useful for ordering because we can use
+	 * the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..1d7d4eb3e7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314..55fe2a935c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4868,7 +4868,7 @@ create_distinct_paths(PlannerInfo *root,
 	else
 	{
 		Size		hashentrysize = hash_agg_entry_size(
-			0, cheapest_input_path->pathtarget->width, 0);
+														0, cheapest_input_path->pathtarget->width, 0);
 
 		/* Allow hashing only if hashtable is predicted to fit in work_mem */
 		allow_hash = (hashentrysize * numDistinctRows <= work_mem * 1024L);
@@ -4924,8 +4924,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4964,29 +4964,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3e836e6e1c..11e6fce9d1 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2741,6 +2741,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c6d648662..4949ef2079 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..583551d197 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
+	 */
+
+	/*
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,70 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +879,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +955,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1050,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1128,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1171,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1288,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1358,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
+	 */
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2752,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2802,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3299,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f..7f993c51de 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2023,68 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..fe4046b64b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index cb012ba198..34f18bd73a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..4f6f2288a3
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1320 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+                                           QUERY PLAN                                            
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: 27kB (avg), 27kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ [                                                                +
+   {                                                              +
+     "Plan": {                                                    +
+       "Node Type": "Limit",                                      +
+       "Parallel Aware": false,                                   +
+       "Actual Rows": 55,                                         +
+       "Actual Loops": 1,                                         +
+       "Plans": [                                                 +
+         {                                                        +
+           "Node Type": "Incremental Sort",                       +
+           "Parent Relationship": "Outer",                        +
+           "Parallel Aware": false,                               +
+           "Actual Rows": 55,                                     +
+           "Actual Loops": 1,                                     +
+           "Sort Key": ["t.a", "t.b"],                            +
+           "Presorted Key": ["t.a"],                              +
+           "Full-sort Groups": {                                  +
+             "Group Count": 2,                                    +
+             "Sort Methods Used": ["quicksort", "top-N heapsort"],+
+             "Average Sort Space Used": 27,                       +
+             "Maximum Sort Space Used": 27,                       +
+             "Sort Space Type": "Memory"                          +
+           },                                                     +
+           "Plans": [                                             +
+             {                                                    +
+               "Node Type": "Sort",                               +
+               "Parent Relationship": "Outer",                    +
+               "Parallel Aware": false,                           +
+               "Actual Rows": 100,                                +
+               "Actual Loops": 1,                                 +
+               "Sort Key": ["t.a"],                               +
+               "Sort Method": "quicksort",                        +
+               "Sort Space Used": 30,                             +
+               "Sort Space Type": "Memory",                       +
+               "Plans": [                                         +
+                 {                                                +
+                   "Node Type": "Seq Scan",                       +
+                   "Parent Relationship": "Outer",                +
+                   "Parallel Aware": false,                       +
+                   "Relation Name": "t",                          +
+                   "Alias": "t",                                  +
+                   "Actual Rows": 100,                            +
+                   "Actual Loops": 1                              +
+                 }                                                +
+               ]                                                  +
+             }                                                    +
+           ]                                                      +
+         }                                                        +
+       ]                                                          +
+     },                                                           +
+     "Triggers": [                                                +
+     ]                                                            +
+   }                                                              +
+ ]
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: 28kB (avg), 28kB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: 25kB (avg), 25kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+                    QUERY PLAN                     
+---------------------------------------------------
+ [                                                +
+   {                                              +
+     "Plan": {                                    +
+       "Node Type": "Limit",                      +
+       "Parallel Aware": false,                   +
+       "Actual Rows": 70,                         +
+       "Actual Loops": 1,                         +
+       "Plans": [                                 +
+         {                                        +
+           "Node Type": "Incremental Sort",       +
+           "Parent Relationship": "Outer",        +
+           "Parallel Aware": false,               +
+           "Actual Rows": 70,                     +
+           "Actual Loops": 1,                     +
+           "Sort Key": ["t.a", "t.b"],            +
+           "Presorted Key": ["t.a"],              +
+           "Full-sort Groups": {                  +
+             "Group Count": 1,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 28,       +
+             "Maximum Sort Space Used": 28,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Presorted Groups": {                  +
+             "Group Count": 5,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 25,       +
+             "Maximum Sort Space Used": 25,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Plans": [                             +
+             {                                    +
+               "Node Type": "Sort",               +
+               "Parent Relationship": "Outer",    +
+               "Parallel Aware": false,           +
+               "Actual Rows": 100,                +
+               "Actual Loops": 1,                 +
+               "Sort Key": ["t.a"],               +
+               "Sort Method": "quicksort",        +
+               "Sort Space Used": 30,             +
+               "Sort Space Type": "Memory",       +
+               "Plans": [                         +
+                 {                                +
+                   "Node Type": "Seq Scan",       +
+                   "Parent Relationship": "Outer",+
+                   "Parallel Aware": false,       +
+                   "Relation Name": "t",          +
+                   "Alias": "t",                  +
+                   "Actual Rows": 100,            +
+                   "Actual Loops": 1              +
+                 }                                +
+               ]                                  +
+             }                                    +
+           ]                                      +
+         }                                        +
+       ]                                          +
+     },                                           +
+     "Triggers": [                                +
+     ]                                            +
+   }                                              +
+ ]
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..9320a10b91
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,88 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

#220

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#219)

8 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Hi,

I've looked at v38 but it seems it's a bit broken by some recent explain
changes (mostly missing type in declarations). Attached is v39 fixing
those issues, and including a bunch of fixes based on a review - most of
the changes is in comments, so I've instead kept them in separate "fix"
patches after each part.

In general I'm mostly happy with the current shape of the patch, and
unless there are some objections I'd like to get some of it committed
sometime next week.

I've done a fair amount of testing with various queries, and the plan
changes seem pretty sensible. I'm still not entirely sure whether to be
a bit conservative and only tweak the first patch adding incremental
sort to extra places, or commit both.

The main thing I still have on my plate is assessment of how much more
expensive can the planning due to increased number of paths we
generate/keep (due to considering extra pathkeys). I haven't seen any
significant slowdowns, but I plan to look at some extreme cases (many
similar and applicable indexes etc.).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v39-0001-Consider-low-startup-cost-when-adding-partial-path.patchtext/plain; charset=us-asciiDownload

From 6fd38a5d79b88202dee963d52629527e37693be6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH 1/8] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 47 ++++++++++-----------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..702189fe17 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -777,41 +777,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Unless pathkeys are incompatible, keep just one of the two paths. */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
-			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.21.1

v39-0002-fix-comments.patchtext/plain; charset=us-asciiDownload

From b7edae462892ebb9e87025b6801985d8e4b31824 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 16:26:37 +0100
Subject: [PATCH 2/8] fix comments

---
 src/backend/optimizer/util/pathnode.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 702189fe17..9c8f3b1f0b 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan, because that will affect how expensive the incremental sort is.
+ *	  because of that we need to consider both the total and startup path,
+ *	  in addition to pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,7 +775,14 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (compring just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
 			PathCostComparison costcmp;
-- 
2.21.1

v39-0003-Implement-incremental-sort.patchtext/plain; charset=us-asciiDownload

From b4c6c28fbf438a02d1d20f0d22b8df3bb51ce922 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH 3/8] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   33 +
 src/backend/executor/nodeIncrementalSort.c    | 1201 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   73 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  296 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   77 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   11 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1320 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   88 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3852 insertions(+), 162 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 70854ae298..fe77f8eb4c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4542,6 +4542,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 58141d8393..dd4600a214 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const	   *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, sortMethodName);
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const	   *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const	   *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	if (!(es->analyze && incrsortstate->sort_Done))
+		return;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+	if (fullsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * XXX: The previous version of the patch chcked:
+			 * fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+			 * and continued if the condition was true (with the comment
+			 * "ignore any unfilled slots"). I'm not convinced that makes
+			 * sense since the same sort instrument can have been used
+			 * multiple times, so the last time it being used being still in
+			 * progress, doesn't seem to be relevant. Instead I'm now checking
+			 * to see if the group count for each group info is 0. If both are
+			 * 0, then we exclude the worker since it didn't contribute
+			 * anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..d15a86a706 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is a Sort node, notify it that it can use bounded sort.
+		 *
+		 * Note: it is the responsibility of nodeSort.c to react properly to
+		 * changes of these parameters.  If we ever redesign this, it'd be a
+		 * good idea to integrate this signaling with the parameter-change
+		 * mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..32ce05a63c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1201 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next set up tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner
+		 * node read out all of those tuples, and then come back around to
+		 * find another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will NULL the slot if no more tuples
+		 * remain. If the tuplesort is empty, but we don't have any more
+		 * tuples avaialable for sort from the outer node, then outerNodeDone
+		 * will have been set so we'll return the empty slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * TODO: there isn't a good test case for the node->outerNodeDone
+			 * case directly, but lots of other stuff fails if it's not there.
+			 * If the outer node will fail when trying to fetch too many
+			 * tuples, then things break if this check isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the inner
+			 * node, but if we tuples remaining in that tuplesort (i.e.,
+			 * n_fullsort_remaining > 0) at this point we need to do that
+			 * again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're
+			 * not in the middle of transitioning into presorted prefix sort
+			 * mode, then it's time to start the process all over again by
+			 * building new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort
+		 * here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us any more tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then
+				 * don't both with checking for inclusion in the current
+				 * prefix group since a large number of very tiny sorts is
+				 * inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this sort
+					 * group. Instead we need to carry it over to the next
+					 * group. We use the group_pivot slot as a temp container
+					 * for that purpose even though we won't actually treat it
+					 * as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for use in
+						 * configuring sorting bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found a
+			 * large group of tuples having a single prefix key (as long as
+			 * the last tuple didn't shift us into reading from the full sort
+			 * mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the
+				 * fullsort to presorted prefix sort (we might have multiple
+				 * prefix key groups, so we need a way to see if we've
+				 * actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've
+		 * already established a current group pivot tuple (but wasn't carried
+		 * over; it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->outerNodeDone = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix
+				 * tuplesort until we find changed prefix keys. Only then can
+				 * we guarantee sort stability of the tuples we've already
+				 * accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * XXX: This is suspect.
+	 *
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	node->outerNodeDone = false;
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b61d..8efbb660b9 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..74799cd8fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none. Any
+	 * leading common pathkeys could be useful for ordering because we can use
+	 * the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..1d7d4eb3e7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index eb25c2f470..b2f4aaadb5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4869,7 +4869,7 @@ create_distinct_paths(PlannerInfo *root,
 	else
 	{
 		Size		hashentrysize = hash_agg_entry_size(
-			0, cheapest_input_path->pathtarget->width, 0);
+														0, cheapest_input_path->pathtarget->width, 0);
 
 		/* Allow hashing only if hashtable is predicted to fit in work_mem */
 		allow_hash = (hashentrysize * numDistinctRows <= work_mem * 1024L);
@@ -4925,8 +4925,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4965,29 +4965,60 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				/* Also consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
-
-			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
-
-			add_path(ordered_rel, path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 9c8f3b1f0b..e15bc1baaa 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af876d1f01..b6ce724557 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..583551d197 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
+	 */
+
+	/*
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,70 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +879,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +955,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1050,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1128,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1171,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1288,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1358,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
+	 */
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2752,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2802,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3299,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50f09..e6a9b67675 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2023,68 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..fe4046b64b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,17 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..4f6f2288a3
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1320 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+                                           QUERY PLAN                                            
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: 27kB (avg), 27kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ [                                                                +
+   {                                                              +
+     "Plan": {                                                    +
+       "Node Type": "Limit",                                      +
+       "Parallel Aware": false,                                   +
+       "Actual Rows": 55,                                         +
+       "Actual Loops": 1,                                         +
+       "Plans": [                                                 +
+         {                                                        +
+           "Node Type": "Incremental Sort",                       +
+           "Parent Relationship": "Outer",                        +
+           "Parallel Aware": false,                               +
+           "Actual Rows": 55,                                     +
+           "Actual Loops": 1,                                     +
+           "Sort Key": ["t.a", "t.b"],                            +
+           "Presorted Key": ["t.a"],                              +
+           "Full-sort Groups": {                                  +
+             "Group Count": 2,                                    +
+             "Sort Methods Used": ["quicksort", "top-N heapsort"],+
+             "Average Sort Space Used": 27,                       +
+             "Maximum Sort Space Used": 27,                       +
+             "Sort Space Type": "Memory"                          +
+           },                                                     +
+           "Plans": [                                             +
+             {                                                    +
+               "Node Type": "Sort",                               +
+               "Parent Relationship": "Outer",                    +
+               "Parallel Aware": false,                           +
+               "Actual Rows": 100,                                +
+               "Actual Loops": 1,                                 +
+               "Sort Key": ["t.a"],                               +
+               "Sort Method": "quicksort",                        +
+               "Sort Space Used": 30,                             +
+               "Sort Space Type": "Memory",                       +
+               "Plans": [                                         +
+                 {                                                +
+                   "Node Type": "Seq Scan",                       +
+                   "Parent Relationship": "Outer",                +
+                   "Parallel Aware": false,                       +
+                   "Relation Name": "t",                          +
+                   "Alias": "t",                                  +
+                   "Actual Rows": 100,                            +
+                   "Actual Loops": 1                              +
+                 }                                                +
+               ]                                                  +
+             }                                                    +
+           ]                                                      +
+         }                                                        +
+       ]                                                          +
+     },                                                           +
+     "Triggers": [                                                +
+     ]                                                            +
+   }                                                              +
+ ]
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: 28kB (avg), 28kB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: 25kB (avg), 25kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+                    QUERY PLAN                     
+---------------------------------------------------
+ [                                                +
+   {                                              +
+     "Plan": {                                    +
+       "Node Type": "Limit",                      +
+       "Parallel Aware": false,                   +
+       "Actual Rows": 70,                         +
+       "Actual Loops": 1,                         +
+       "Plans": [                                 +
+         {                                        +
+           "Node Type": "Incremental Sort",       +
+           "Parent Relationship": "Outer",        +
+           "Parallel Aware": false,               +
+           "Actual Rows": 70,                     +
+           "Actual Loops": 1,                     +
+           "Sort Key": ["t.a", "t.b"],            +
+           "Presorted Key": ["t.a"],              +
+           "Full-sort Groups": {                  +
+             "Group Count": 1,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 28,       +
+             "Maximum Sort Space Used": 28,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Presorted Groups": {                  +
+             "Group Count": 5,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 25,       +
+             "Maximum Sort Space Used": 25,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Plans": [                             +
+             {                                    +
+               "Node Type": "Sort",               +
+               "Parent Relationship": "Outer",    +
+               "Parallel Aware": false,           +
+               "Actual Rows": 100,                +
+               "Actual Loops": 1,                 +
+               "Sort Key": ["t.a"],               +
+               "Sort Method": "quicksort",        +
+               "Sort Space Used": 30,             +
+               "Sort Space Type": "Memory",       +
+               "Plans": [                         +
+                 {                                +
+                   "Node Type": "Seq Scan",       +
+                   "Parent Relationship": "Outer",+
+                   "Parallel Aware": false,       +
+                   "Relation Name": "t",          +
+                   "Alias": "t",                  +
+                   "Actual Rows": 100,            +
+                   "Actual Loops": 1              +
+                 }                                +
+               ]                                  +
+             }                                    +
+           ]                                      +
+         }                                        +
+       ]                                          +
+     },                                           +
+     "Triggers": [                                +
+     ]                                            +
+   }                                              +
+ ]
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..9320a10b91
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,88 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.21.1

v39-0004-fix.patchtext/plain; charset=us-asciiDownload

From d0ee7837d2e8137390f6e64188ecbd4a1addb11d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 16:55:28 +0100
Subject: [PATCH 4/8] fix

---
 src/backend/commands/explain.c       |  6 ++--
 src/backend/executor/execProcnode.c  | 11 ++++----
 src/backend/optimizer/plan/planner.c | 41 +++++++++++++++++-----------
 src/backend/utils/sort/tuplesort.c   | 10 +++++--
 src/include/nodes/plannodes.h        |  1 -
 5 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index dd4600a214..0256dd42f1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2767,7 +2767,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 
 		foreach(methodCell, groupInfo->sortMethods)
 		{
-			const	   *sortMethodName = tuplesort_method_name(methodCell->int_value);
+			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
 
 			methodNames = lappend(methodNames, sortMethodName);
 		}
@@ -2776,7 +2776,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 		if (groupInfo->maxMemorySpaceUsed > 0)
 		{
 			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
-			const	   *spaceTypeName;
+			const char *spaceTypeName;
 
 			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
 			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
@@ -2787,7 +2787,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 		if (groupInfo->maxDiskSpaceUsed > 0)
 		{
 			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
-			const	   *spaceTypeName;
+			const char *spaceTypeName;
 
 			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
 			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d15a86a706..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -852,12 +852,13 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 	else if (IsA(child_node, IncrementalSortState))
 	{
 		/*
-		 * If it is a Sort node, notify it that it can use bounded sort.
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
 		 *
-		 * Note: it is the responsibility of nodeSort.c to react properly to
-		 * changes of these parameters.  If we ever redesign this, it'd be a
-		 * good idea to integrate this signaling with the parameter-change
-		 * mechanism.
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
 		 */
 		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b2f4aaadb5..0e01cf8cb1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4869,7 +4869,7 @@ create_distinct_paths(PlannerInfo *root,
 	else
 	{
 		Size		hashentrysize = hash_agg_entry_size(
-														0, cheapest_input_path->pathtarget->width, 0);
+			0, cheapest_input_path->pathtarget->width, 0);
 
 		/* Allow hashing only if hashtable is predicted to fit in work_mem */
 		allow_hash = (hashentrysize * numDistinctRows <= work_mem * 1024L);
@@ -4932,6 +4932,9 @@ create_distinct_paths(PlannerInfo *root,
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -5002,23 +5005,29 @@ create_ordered_paths(PlannerInfo *root,
 
 				add_path(ordered_rel, sorted_path);
 			}
-			if (enable_incrementalsort && presorted_keys > 0)
-			{
-				/* Also consider incremental sort. */
-				sorted_path = (Path *) create_incremental_sort_path(root,
-																	ordered_rel,
-																	input_path,
-																	root->sort_pathkeys,
-																	presorted_keys,
-																	limit_tuples);
 
-				/* Add projection step if needed */
-				if (sorted_path->pathtarget != target)
-					sorted_path = apply_projection_to_path(root, ordered_rel,
-														   sorted_path, target);
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
 
-				add_path(ordered_rel, sorted_path);
-			}
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
+			/* Add projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 583551d197..77c15ebd78 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -256,8 +256,8 @@ struct Tuplesortstate
 	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
 								 * space, false when it's value for in-memory
 								 * space */
-	TupSortStatus maxSpaceStatus;	/* sort status when maxSpace was reached */
-	MemoryContext maincontext;	/* memory context for tuple sort metadata that
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
 								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
@@ -803,6 +803,9 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ * XXX Missing comment.
+ */
 static void
 tuplesort_begin_batch(Tuplesortstate *state)
 {
@@ -1288,6 +1291,9 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+/*
+ * XXX Missing comment.
+ */
 bool
 tuplesort_used_bound(Tuplesortstate *state)
 {
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index fe4046b64b..136d794219 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,7 +774,6 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
-
 /* ----------------
  *		incremental sort node
  * ----------------
-- 
2.21.1

v39-0005-Consider-incremental-sort-paths-in-additional-places.patchtext/plain; charset=us-asciiDownload

From 1baba95ea4d800de7d9ffcd24f04cec279cde518 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH 5/8] Consider incremental sort paths in additional places

---
 src/backend/optimizer/path/allpaths.c | 221 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  | 130 ++++++++++++++-
 src/include/optimizer/paths.h         |   2 +
 3 files changed, 350 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..9a92948fe3 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,223 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Pushing the query_pathkeys to the remote server is always worth
+	 * considering, because it might let us avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * The planner and executor don't have any clever strategy for
+			 * taking data sorted by a prefix of the query's pathkeys and
+			 * getting it to be sorted by all of those pathkeys. We'll just
+			 * end up re-sorting the entire data set.  So, unless we can push
+			 * down all of the query pathkeys, forget it.
+			 *
+			 * is_foreign_expr would detect volatile expressions as well, but
+			 * checking ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		/*
+		 * This ends up allowing us to do incremental sort on top of an index
+		 * scan all parallelized under a gather merge node.
+		 */
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike generate_gather_paths, this does not look only at pathkeys of the
+ * input paths (aiming to preserve the ordering). It also considers ordering
+ * that might be useful by nodes above the gather merge node, and tries to
+ * add a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather merge paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * consider regular sort for cheapest partial path (for each
+			 * useful pathkeys)
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/* Also consider incremental sort */
+			if (presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3116,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0e01cf8cb1..46dc355af3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6434,7 +6434,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6493,6 +6495,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6819,7 +6895,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6854,6 +6932,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7235,7 +7363,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..f6994779de 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
-- 
2.21.1

v39-0006-fix.patchtext/plain; charset=us-asciiDownload

From 5752bf071168a1dfd16fa3ae045182ccd3e5ee81 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 18:26:16 +0100
Subject: [PATCH 6/8] fix

---
 src/backend/optimizer/path/allpaths.c | 62 +++++++++++++++++----------
 1 file changed, 39 insertions(+), 23 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 9a92948fe3..6838a238cd 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2764,6 +2764,14 @@ find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
  * order matches the final output ordering for the overall query we're
  * planning, or because it enables an efficient merge join.  Here, we try
  * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future.
  */
 static List *
 get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
@@ -2772,8 +2780,8 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 	ListCell   *lc;
 
 	/*
-	 * Pushing the query_pathkeys to the remote server is always worth
-	 * considering, because it might let us avoid a local sort.
+	 * Considering query_pathkeys is always worth it, because it might let us
+	 * avoid a local sort.
 	 */
 	if (root->query_pathkeys)
 	{
@@ -2786,14 +2794,9 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 			Expr	   *em_expr;
 
 			/*
-			 * The planner and executor don't have any clever strategy for
-			 * taking data sorted by a prefix of the query's pathkeys and
-			 * getting it to be sorted by all of those pathkeys. We'll just
-			 * end up re-sorting the entire data set.  So, unless we can push
-			 * down all of the query pathkeys, forget it.
-			 *
-			 * is_foreign_expr would detect volatile expressions as well, but
-			 * checking ec_has_volatile here saves some cycles.
+			 * We can't use incremental sort for pathkeys containing volatile
+			 * expressions. We could walk the exppression itself, but checking
+			 * ec_has_volatile here saves some cycles.
 			 */
 			if (pathkey_ec->ec_has_volatile ||
 				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
@@ -2803,10 +2806,6 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 			}
 		}
 
-		/*
-		 * This ends up allowing us to do incremental sort on top of an index
-		 * scan all parallelized under a gather merge node.
-		 */
 		if (query_pathkeys_ok)
 			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
 	}
@@ -2819,10 +2818,10 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
  *		Generate parallel access paths for a relation by pushing a Gather or
  *		Gather Merge on top of a partial path.
  *
- * Unlike generate_gather_paths, this does not look only at pathkeys of the
- * input paths (aiming to preserve the ordering). It also considers ordering
- * that might be useful by nodes above the gather merge node, and tries to
- * add a sort (regular or incremental) to provide that.
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
  */
 void
 generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
@@ -2841,7 +2840,7 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 	if (override_rows)
 		rowsp = &rows;
 
-	/* generate the regular gather merge paths */
+	/* generate the regular gather (merge) paths */
 	generate_gather_paths(root, rel, override_rows);
 
 	/* when incremental sort is disabled, we're done */
@@ -2851,7 +2850,7 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 	/* consider incremental sort for interesting orderings */
 	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
 
-	/* used for explicit sort paths */
+	/* used for explicit (full) sort paths */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
 
 	/*
@@ -2880,6 +2879,14 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 													 subpath->pathkeys,
 													 &presorted_keys);
 
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
 			if (is_sorted)
 			{
 				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
@@ -2892,8 +2899,14 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 			Assert(!is_sorted);
 
 			/*
-			 * consider regular sort for cheapest partial path (for each
-			 * useful pathkeys)
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
 			 */
 			if (cheapest_partial_path == subpath)
 			{
@@ -2919,7 +2932,10 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 				/* Fall through */
 			}
 
-			/* Also consider incremental sort */
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
 			if (presorted_keys > 0)
 			{
 				Path	   *tmp;
-- 
2.21.1

v39-0007-A-couple-more-places-for-incremental-sort.patchtext/plain; charset=us-asciiDownload

From 2e6cbe5bc169ebe1e9971b5256003d02349ad1ce Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH 7/8] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/plan/planner.c   | 220 ++++++++++++++++++++++++-
 2 files changed, 217 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 46dc355af3..2880fcabe8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5080,6 +5080,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we
+				 * can't simply skip it, because it may be partially sorted in
+				 * which case we want to consider incremental sort on top of
+				 * it (instead of full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6580,12 +6641,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6616,6 +6683,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6887,6 +7004,60 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Also consider incremental sort on all partially sorted paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6951,10 +7122,10 @@ create_partial_grouping_paths(PlannerInfo *root,
 			/* We've already skipped fully sorted paths above. */
 			Assert(!is_sorted);
 
-			/* no shared prefix, not point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
+			/* Since we have presorted keys, consider incremental sort. */
 			path = (Path *) create_incremental_sort_path(root,
 														 partially_grouped_rel,
 														 path,
@@ -7079,10 +7250,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7108,6 +7280,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7209,7 +7421,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.21.1

v39-0008-fix.patchtext/plain; charset=us-asciiDownload

From 59e9a858763a274405b5015344526092c6e55af5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 18:48:19 +0100
Subject: [PATCH 8/8] fix

---
 src/backend/optimizer/plan/planner.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 2880fcabe8..b4763e79f8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5100,17 +5100,17 @@ create_ordered_paths(PlannerInfo *root,
 				double		total_groups;
 
 				/*
-				 * We don't care if this is the cheapest partial path - we
-				 * can't simply skip it, because it may be partially sorted in
-				 * which case we want to consider incremental sort on top of
-				 * it (instead of full sort, which is what happens above).
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
 				 */
 
 				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
 														 input_path->pathkeys,
 														 &presorted_keys);
 
-				/* Ignore already sorted paths */
+				/* No point in adding incremental sort on fully sorted paths. */
 				if (is_sorted)
 					continue;
 
@@ -7005,9 +7005,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 			}
 		}
 
-		/*
-		 * Also consider incremental sort on all partially sorted paths.
-		 */
+		/* Consider incremental sort on all partial paths, if enabled. */
 		if (enable_incrementalsort)
 		{
 			foreach(lc, input_rel->pathlist)
@@ -7122,10 +7120,10 @@ create_partial_grouping_paths(PlannerInfo *root,
 			/* We've already skipped fully sorted paths above. */
 			Assert(!is_sorted);
 
+			/* no shared prefix, not point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
-			/* Since we have presorted keys, consider incremental sort. */
 			path = (Path *) create_incremental_sort_path(root,
 														 partially_grouped_rel,
 														 path,
-- 
2.21.1

#221

Andreas Karlsson

andreas@proxel.se

almost 6 years ago

In reply to: Tomas Vondra (#220)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 3/21/20 1:56 AM, Tomas Vondra wrote:

I've looked at v38 but it seems it's a bit broken by some recent explain
changes (mostly missing type in declarations). Attached is v39 fixing
those issues, and including a bunch of fixes based on a review - most of
the changes is in comments, so I've instead kept them in separate "fix"
patches after each part.

In general I'm mostly happy with the current shape of the patch, and
unless there are some objections I'd like to get some of it committed
sometime next week.

Hi,

I haven't had any time to look at the patch yet but when compiling it
and running the tests GCC gave me a warning. The tests passed btw.

gcc (Debian 8.3.0-6) 8.3.0

explain.c: In function ï¿½show_incremental_sort_group_infoï¿½:
explain.c:2772:39: warning: passing argument 2 of ï¿½lappendï¿½ discards
ï¿½constï¿½ qualifier from pointer target type [-Wdiscarded-qualifiers]
methodNames = lappend(methodNames, sortMethodName);
^~~~~~~~~~~~~~
In file included from ../../../src/include/access/xact.h:20,
from explain.c:16:
../../../src/include/nodes/pg_list.h:509:14: note: expected ï¿½void *ï¿½ but
argument is of type ï¿½const char *ï¿½
extern List *lappend(List *list, void *datum);
^~~~~~~
explain.c:2772:39: warning: passing 'const char *' to parameter of type
'void *' discards qualifiers
[-Wincompatible-pointer-types-discards-qualifiers]
methodNames = lappend(methodNames, sortMethodName);
^~~~~~~~~~~~~~
../../../src/include/nodes/pg_list.h:509:40: note: passing argument to
parameter 'datum' here
extern List *lappend(List *list, void *datum);
^
Andreas

#222

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Andreas Karlsson (#221)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Mar 22, 2020 at 6:02 PM Andreas Karlsson <andreas@proxel.se> wrote:

On 3/21/20 1:56 AM, Tomas Vondra wrote:

I've looked at v38 but it seems it's a bit broken by some recent explain
changes (mostly missing type in declarations). Attached is v39 fixing
those issues, and including a bunch of fixes based on a review - most of
the changes is in comments, so I've instead kept them in separate "fix"
patches after each part.

In general I'm mostly happy with the current shape of the patch, and
unless there are some objections I'd like to get some of it committed
sometime next week.

Hi,

I haven't had any time to look at the patch yet but when compiling it
and running the tests GCC gave me a warning. The tests passed btw.

gcc (Debian 8.3.0-6) 8.3.0

explain.c: In function ‘show_incremental_sort_group_info’:
explain.c:2772:39: warning: passing argument 2 of ‘lappend’ discards
‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
methodNames = lappend(methodNames, sortMethodName);
^~~~~~~~~~~~~~
In file included from ../../../src/include/access/xact.h:20,
from explain.c:16:
../../../src/include/nodes/pg_list.h:509:14: note: expected ‘void *’ but
argument is of type ‘const char *’
extern List *lappend(List *list, void *datum);
^~~~~~~
explain.c:2772:39: warning: passing 'const char *' to parameter of type
'void *' discards qualifiers
[-Wincompatible-pointer-types-discards-qualifiers]
methodNames = lappend(methodNames, sortMethodName);
^~~~~~~~~~~~~~
../../../src/include/nodes/pg_list.h:509:40: note: passing argument to
parameter 'datum' here
extern List *lappend(List *list, void *datum);

So if we naively get rid of the const on the variable declaration in
question, then we get this warning instead:

explain.c: In function ‘show_incremental_sort_group_info’:
explain.c:2770:27: warning: initialization discards ‘const’ qualifier
from pointer target type [-Wdiscarded-qualifiers]
char *sortMethodName = tuplesort_method_name(methodCell->int_value);

So on the face of it we have a bit of a no-win situation. The function
tuple_sort_method_name returns a const, but lappend wants a non-const.
I'm not sure what the project style preference is here: we could cast
the result as (char *) to drop the const qualifier, but that's frowned
upon some places. Alternatively we could make a new non-const copy of
string. Which is preferable in the postgres project style?

James

#223

Andreas Karlsson

andreas@proxel.se

almost 6 years ago

In reply to: James Coleman (#222)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 3/23/20 1:33 AM, James Coleman wrote:

So on the face of it we have a bit of a no-win situation. The function
tuple_sort_method_name returns a const, but lappend wants a non-const.
I'm not sure what the project style preference is here: we could cast
the result as (char *) to drop the const qualifier, but that's frowned
upon some places. Alternatively we could make a new non-const copy of
string. Which is preferable in the postgres project style?

The PostgreSQL has places where const is explicitly casted away with the
unconstify() macro, so unless you can find a better solution that is
probably an ok option.

Andreas

#224

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Andreas Karlsson (#223)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Mar 22, 2020 at 8:54 PM Andreas Karlsson <andreas@proxel.se> wrote:

On 3/23/20 1:33 AM, James Coleman wrote:

So on the face of it we have a bit of a no-win situation. The function
tuple_sort_method_name returns a const, but lappend wants a non-const.
I'm not sure what the project style preference is here: we could cast
the result as (char *) to drop the const qualifier, but that's frowned
upon some places. Alternatively we could make a new non-const copy of
string. Which is preferable in the postgres project style?

The PostgreSQL has places where const is explicitly casted away with the
unconstify() macro, so unless you can find a better solution that is
probably an ok option.

Thanks, that's exactly what I need!

James

#225

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#220)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Mar 20, 2020 at 8:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

I've looked at v38 but it seems it's a bit broken by some recent explain
changes (mostly missing type in declarations). Attached is v39 fixing
those issues, and including a bunch of fixes based on a review - most of
the changes is in comments, so I've instead kept them in separate "fix"
patches after each part.

In general I'm mostly happy with the current shape of the patch, and
unless there are some objections I'd like to get some of it committed
sometime next week.

I've done a fair amount of testing with various queries, and the plan
changes seem pretty sensible. I'm still not entirely sure whether to be
a bit conservative and only tweak the first patch adding incremental
sort to extra places, or commit both.

The main thing I still have on my plate is assessment of how much more
expensive can the planning due to increased number of paths we
generate/keep (due to considering extra pathkeys). I haven't seen any
significant slowdowns, but I plan to look at some extreme cases (many
similar and applicable indexes etc.).

I'm currently incorporating all of the fixes you proposed into the
main patch series, as well as doing a thorough read-through of the
current state of the patch. I'm hoping to reply tomorrow with:

- Fix patches of my own to clean up and add additional comments.
- Catalog all of the current open questions (XXX, etc.) in the patch
to more easily discuss them in the mailing list.

One question I have while I work on that: I've noticed some confusion
in the patch as to whether we should refer to the node below the
incremental sort node in the plan tree (i.e., the node we get tuples
from) as the inner node or the outer node. Intuitively I'd expect to
call it the inner node, but the original patch referred to it
frequently as the outer node. The outerPlanState/innerPlanState macro
comments don't offer a lot of clarification though they're "to avoid
confusion" about right/left inner/outer. I suppose if the
outerPlanState macro is working here the correct term should be outer?

Thanks,
James

#226

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#224)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Mar 22, 2020 at 10:05:50PM -0400, James Coleman wrote:

On Sun, Mar 22, 2020 at 8:54 PM Andreas Karlsson <andreas@proxel.se> wrote:

On 3/23/20 1:33 AM, James Coleman wrote:

So on the face of it we have a bit of a no-win situation. The function
tuple_sort_method_name returns a const, but lappend wants a non-const.
I'm not sure what the project style preference is here: we could cast
the result as (char *) to drop the const qualifier, but that's frowned
upon some places. Alternatively we could make a new non-const copy of
string. Which is preferable in the postgres project style?

The PostgreSQL has places where const is explicitly casted away with the
unconstify() macro, so unless you can find a better solution that is
probably an ok option.

Thanks, that's exactly what I need!

Yeah, sorry I forgot to mention/fix the warning in the last review round.

BTW I think the comment for pathkeys_useful_for_ordering() needs
updating, it still claims it's all-or-nothing affair, but that's no
longer true I think.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#227

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#225)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 2020-Mar-22, James Coleman wrote:

One question I have while I work on that: I've noticed some confusion
in the patch as to whether we should refer to the node below the
incremental sort node in the plan tree (i.e., the node we get tuples
from) as the inner node or the outer node. Intuitively I'd expect to
call it the inner node, but the original patch referred to it
frequently as the outer node. The outerPlanState/innerPlanState macro
comments don't offer a lot of clarification though they're "to avoid
confusion" about right/left inner/outer. I suppose if the
outerPlanState macro is working here the correct term should be outer?

I think the inner/outer distinction comes from join nodes wanting to
distinguish which child drives the scan of the other. If there's a
single child, there's no need to make such a distinction: it's just "the
child". And if it's the only child, conventionally we use the first
one, which conventionally is (for us westerners) the one on the left.
This view is supported by the fact that outerPlanState() appears 113
times in the code whereas innerPlanState() appears only 27 times --
that is, all plan types that use only one child use the outer one. They
could use either, as long as it does that consistently, I think.

Therefore the term should be "outer". It's not "outer" to the parent
incremental sort; it's just the "outer" of its two possible children.

I think.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#228

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Alvaro Herrera (#227)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

... all plan types that use only one child use the outer one. They
could use either, as long as it does that consistently, I think.

Yeah, exactly. The outer/inner terminology is really only sensible
for join nodes, but there isn't a third child-plan pointer reserved
for single-child node types, so you gotta use one of those. And
conventionally we use the "outer" one.

regards, tom lane

#229

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tom Lane (#228)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Mar 23, 2020 at 1:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

... all plan types that use only one child use the outer one. They
could use either, as long as it does that consistently, I think.

Yeah, exactly. The outer/inner terminology is really only sensible
for join nodes, but there isn't a third child-plan pointer reserved
for single-child node types, so you gotta use one of those. And
conventionally we use the "outer" one.

regards, tom lane

Great, thanks for the explanation Alvaro and Tom; I'll fix that up in
my next patch series.

I idly wonder if a macro childPlanState() defined exactly the same as
outerPlanState() might _kinda_ make sense here, but I'm also content
to follow convention.

I might submit a small patch to the comment on those macros though to
expand on the explanation.

James

#230

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#225)

7 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Mar 22, 2020 at 10:17 PM James Coleman <jtc331@gmail.com> wrote:

On Fri, Mar 20, 2020 at 8:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

I've looked at v38 but it seems it's a bit broken by some recent explain
changes (mostly missing type in declarations). Attached is v39 fixing
those issues, and including a bunch of fixes based on a review - most of
the changes is in comments, so I've instead kept them in separate "fix"
patches after each part.

In general I'm mostly happy with the current shape of the patch, and
unless there are some objections I'd like to get some of it committed
sometime next week.

I've done a fair amount of testing with various queries, and the plan
changes seem pretty sensible. I'm still not entirely sure whether to be
a bit conservative and only tweak the first patch adding incremental
sort to extra places, or commit both.

The main thing I still have on my plate is assessment of how much more
expensive can the planning due to increased number of paths we
generate/keep (due to considering extra pathkeys). I haven't seen any
significant slowdowns, but I plan to look at some extreme cases (many
similar and applicable indexes etc.).

I'm currently incorporating all of the fixes you proposed into the
main patch series, as well as doing a thorough read-through of the
current state of the patch. I'm hoping to reply tomorrow with:

- Fix patches of my own to clean up and add additional comments.
- Catalog all of the current open questions (XXX, etc.) in the patch
to more easily discuss them in the mailing list.

Here's the above.

Current TODOs:

1. src/backend/optimizer/util/pathnode.c add_partial_path()
* XXX Perhaps we could do this only when incremental sort is enabled,
* and use the simpler version (comparing just total cost) otherwise?

I don't have a strong opinion here. It doesn't seem like a significant
difference in terms of cost?

2. Not marked in the patch, but in nodeIncrementalSort.c
ExecIncrementalSort() I wonder if perhaps we should move the algorithm
discussion comments up to the file header comment. On the other hand,
I suppose it could be valuable to leave the the file header comment
more high level about the mathematical properties of incremental sort
rather than discussing the details of the hybrid mode.

3. nodeIncrementalSort.c ExecIncrementalSort() in the main for loop:
* TODO: do we need to check for interrupts inside these loops or
* will the outer node handle that?

4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
is suspect. I've mentioned previously I don't have a great mental
model of how rescan works and its invariants (IIRC someone said it was
about moving around a result set in a cursor). Regardless I'm pretty
sure this code just doesn't work correctly. Additionally the sort_Done
variable is poorly named; it probably would make more sense to call it
something like "scanHasBegun". I'm waiting to change it though until
cleaning up this code more holistically.

5. planner.c create_ordered_paths:
* XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
* other pathkeys (grouping, ...) like generate_useful_gather_paths.

6. regress/expected/incremental_sort.out:
-- TODO if an analyze happens here the plans might change; should we
-- solve by inserting extra rows or by adding a GUC that would somehow
-- forcing the time of plan we expect.

Maybe this isn't an actual issue (I have vague memory of auto-analyze
being disabled during these regression tests)?

7. Not listed as a comment in the patch, but I need to modify the
testing for analyze output to parse out the memory/disk stats to the
tests are stable.

8. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* XXX At the moment this can only ever return a list with a single element,
* because it looks at query_pathkeys only. So we might return the pathkeys
* directly, but it seems plausible we'll want to consider other orderings
* in the future.

This might be something we just leave in as a comment?

9. In the same function as the above:
* Considering query_pathkeys is always worth it, because it might let us
* avoid a local sort.

That seems like a copy from the fdw code; I didn't remove it in the
attached patchset, because I think I have a diff on a branch somewhere
that standardizes some of this shared code between here and the
postgres_fdw code.

10. optimizer/path/allpaths.c generate_useful_gather_paths:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.

11. In the same function as the above:
* XXX Can't we skip this (maybe only for the cheapest partial path)
* when the path is already sorted? Then it's likely duplicate with
* the path created by generate_gather_paths.

12. In the same function as the above:
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

13. planner.c create_ordered_paths:
* XXX This is probably duplicate with the paths we already generate
* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.

Attached is v40, including quite a bit of comment cleanup primarily.

James

Attachments:

v40-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v40-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 70ca803d6a5c0dd0dc76d74e08cf9c6172d8089e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v40 1/7] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..9c8f3b1f0b 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan, because that will affect how expensive the incremental sort is.
+ *	  because of that we need to consider both the total and startup path,
+ *	  in addition to pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (compring just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v40-0004-fix-lots-of-comments.patchtext/x-patch; charset=US-ASCII; name=v40-0004-fix-lots-of-comments.patchDownload

From 6e247a3f2c2479db8d959fd0b90c0dfa94817454 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Mon, 23 Mar 2020 21:59:32 -0400
Subject: [PATCH v40 4/7] fix lots of comments

---
 src/backend/commands/explain.c             |  24 +-
 src/backend/executor/nodeIncrementalSort.c | 275 +++++++++++++--------
 src/backend/optimizer/path/costsize.c      |   2 +-
 src/backend/optimizer/path/pathkeys.c      |  12 +-
 src/backend/optimizer/plan/planner.c       |   4 +-
 src/backend/utils/sort/tuplesort.c         |  11 +-
 src/include/nodes/execnodes.h              |   4 +
 7 files changed, 201 insertions(+), 131 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0256dd42f1..cf8cfd31f5 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2703,7 +2703,14 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
-
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
 static void
 show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 								 const char *groupLabel, ExplainState *es)
@@ -2769,7 +2776,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 		{
 			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
 
-			methodNames = lappend(methodNames, sortMethodName);
+			methodNames = lappend(methodNames, unconstify(char *, sortMethodName));
 		}
 		ExplainPropertyList("Sort Methods Used", methodNames, es);
 
@@ -2831,16 +2838,9 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 			&incrsortstate->shared_info->sinfo[n];
 
 			/*
-			 * XXX: The previous version of the patch chcked:
-			 * fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
-			 * and continued if the condition was true (with the comment
-			 * "ignore any unfilled slots"). I'm not convinced that makes
-			 * sense since the same sort instrument can have been used
-			 * multiple times, so the last time it being used being still in
-			 * progress, doesn't seem to be relevant. Instead I'm now checking
-			 * to see if the group count for each group info is 0. If both are
-			 * 0, then we exclude the worker since it didn't contribute
-			 * anything meaningful.
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
 			 */
 			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
 			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 32ce05a63c..296a0c0675 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -68,6 +68,14 @@
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
 
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
 static void
 instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 					  Tuplesortstate *sortState)
@@ -78,6 +86,8 @@ instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 	groupInfo->groupCount++;
 
 	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
 	switch (sort_instr.spaceType)
 	{
 		case SORT_SPACE_TYPE_DISK:
@@ -94,6 +104,7 @@ instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 			break;
 	}
 
+	/* Track each sort method we've used. */
 	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
 		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
 											 sort_instr.sortMethod);
@@ -109,8 +120,11 @@ instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 	}
 }
 
-/*
- * Prepare information for presorted_keys comparison.
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
  */
 static void
 preparePresortedCols(IncrementalSortState *node)
@@ -153,11 +167,12 @@ preparePresortedCols(IncrementalSortState *node)
 	}
 }
 
-/*
- * Check whether a given tuple belongs to the current sort group.
+/* ----------------------------------------------------------------
+ * isCurrentGroup
  *
- * We do this by comparing its first 'presortedCols' column values to
- * the pivot tuple of the current group.
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
  */
 static bool
 isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
@@ -214,8 +229,8 @@ isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot
 	return true;
 }
 
-/*
- * Switch to presorted prefix mode.
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
  *
  * When we determine that we've likely encountered a large batch of tuples all
  * having the same presorted prefix values, we want to optimize tuplesort by
@@ -224,13 +239,14 @@ isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot
  * The problem is that we've already accumulated several tuples in another
  * tuplesort configured to sort by all columns (assuming that there may be
  * more than one prefix key group). So to switch to presorted prefix mode we
- * have to go back an look at all the tuples we've already accumulated and
+ * have to go back and look at all the tuples we've already accumulated to
  * verify they're all part of the same prefix key group before sorting them
  * solely by unsorted suffix keys.
  *
  * While it's likely that all already fetch tuples are all part of a single
  * prefix group, we also have to handle the possibility that there is at least
  * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
  */
 static void
 switchToPresortedPrefixMode(PlanState *pstate)
@@ -248,6 +264,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 	outerNode = outerPlanState(node);
 	tupDesc = ExecGetResultType(outerNode);
 
+	/* Configure the prefix sort state the first time around. */
 	if (node->prefixsort_state == NULL)
 	{
 		Tuplesortstate *prefixsort_state;
@@ -287,6 +304,10 @@ switchToPresortedPrefixMode(PlanState *pstate)
 							node->bound - node->bound_Done);
 	}
 
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
 	for (;;)
 	{
 		lastTuple = node->n_fullsort_remaining - nTuples == 1;
@@ -326,7 +347,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 			{
 				/*
 				 * The tuple isn't part of the current batch so we need to
-				 * carry it over into the next set up tuples we transfer out
+				 * carry it over into the next batch of tuples we transfer out
 				 * of the full sort tuplesort into the presorted prefix
 				 * tuplesort. We don't actually have to do anything special to
 				 * save the tuple since we've already loaded it into the
@@ -344,12 +365,14 @@ switchToPresortedPrefixMode(PlanState *pstate)
 
 		firstTuple = false;
 
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next tuple
+		 * from the outer plan node, so we retain the current group pivot tuple
+		 * prefix key group comparison.
+		 */
 		if (lastTuple)
-
-			/*
-			 * We retain the current group pivot tuple since we haven't yet
-			 * found the end of the current prefix key group.
-			 */
 			break;
 	}
 
@@ -387,9 +410,9 @@ switchToPresortedPrefixMode(PlanState *pstate)
 	{
 		/*
 		 * We finished a group but didn't consume all of the tuples from the
-		 * full sort batch sorter, so we'll sort this batch, let the inner
-		 * node read out all of those tuples, and then come back around to
-		 * find another batch.
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
 		 */
 		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
 		tuplesort_performsort(node->prefixsort_state);
@@ -402,7 +425,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 		if (node->bounded)
 		{
 			/*
-			 * If the current node has a bound, and we've already sorted n
+			 * If the current node has a bound and we've already sorted n
 			 * tuples, then the functional bound remaining is (original bound
 			 * - n), so store the current number of processed tuples for use
 			 * in configuring sorting bound.
@@ -490,6 +513,11 @@ ExecIncrementalSort(PlanState *pstate)
 	dir = estate->es_direction;
 	fullsort_state = node->fullsort_state;
 
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to see
+	 * if there are any remaining tuples in that batch that we can return before
+	 * moving on to other execution states.
+	 */
 	if (node->execution_status == INCSORT_READFULLSORT
 		|| node->execution_status == INCSORT_READPREFIXSORT)
 	{
@@ -502,19 +530,18 @@ ExecIncrementalSort(PlanState *pstate)
 
 		/*
 		 * We have to populate the slot from the tuplesort before checking
-		 * outerNodeDone because it will NULL the slot if no more tuples
+		 * outerNodeDone because it will set the slot to NULL if no more tuples
 		 * remain. If the tuplesort is empty, but we don't have any more
-		 * tuples avaialable for sort from the outer node, then outerNodeDone
-		 * will have been set so we'll return the empty slot to the caller.
+		 * tuples available for sort from the outer node, then outerNodeDone
+		 * will have been set so we'll return that now-empty slot to the caller.
 		 */
 		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
 								   false, slot, NULL) || node->outerNodeDone)
 
 			/*
-			 * TODO: there isn't a good test case for the node->outerNodeDone
-			 * case directly, but lots of other stuff fails if it's not there.
-			 * If the outer node will fail when trying to fetch too many
-			 * tuples, then things break if this check isn't here.
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer node
+			 * will fail when trying to fetch too many tuples.
 			 */
 			return slot;
 		else if (node->n_fullsort_remaining > 0)
@@ -524,10 +551,11 @@ ExecIncrementalSort(PlanState *pstate)
 			 * accumulated at least one additional prefix key group in the
 			 * full sort tuplesort. The first call to
 			 * switchToPresortedPrefixMode() will have pulled the first one of
-			 * those groups out, and we've returned those tuples to the inner
-			 * node, but if we tuples remaining in that tuplesort (i.e.,
-			 * n_fullsort_remaining > 0) at this point we need to do that
-			 * again.
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in the
+			 * full sort state (i.e., n_fullsort_remaining > 0), then we need to
+			 * re-execute the prefix mode transition function to pull out the
+			 * next prefix key group.
 			 */
 			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
 					   node->n_fullsort_remaining);
@@ -536,10 +564,10 @@ ExecIncrementalSort(PlanState *pstate)
 		else
 		{
 			/*
-			 * If we don't have any already sorted tuples to read, and we're
-			 * not in the middle of transitioning into presorted prefix sort
-			 * mode, then it's time to start the process all over again by
-			 * building new full sort group.
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
 			 */
 			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
 			node->execution_status = INCSORT_LOADFULLSORT;
@@ -547,24 +575,26 @@ ExecIncrementalSort(PlanState *pstate)
 	}
 
 	/*
-	 * Want to scan subplan in the forward direction while creating the sorted
-	 * data.
+	 * Scan the subplan in the forward direction while creating the sorted data.
 	 */
 	estate->es_direction = ForwardScanDirection;
 
 	outerNode = outerPlanState(node);
 	tupDesc = ExecGetResultType(outerNode);
 
+	/* Load tuples into the full sort state. */
 	if (node->execution_status == INCSORT_LOADFULLSORT)
 	{
 		/*
-		 * Initialize tuplesort module (only needed before the first group).
+		 * Initialize sorting structures.
 		 */
 		if (fullsort_state == NULL)
 		{
 			/*
 			 * Initialize presorted column support structures for
-			 * isCurrentGroup().
+			 * isCurrentGroup(). It's correct to do this along with the initial
+			 * intialization for the full sort state (and not for the prefix
+			 * sort state) since we always load the full sort state first.
 			 */
 			preparePresortedCols(node);
 
@@ -572,8 +602,7 @@ ExecIncrementalSort(PlanState *pstate)
 			 * Since we optimize small prefix key groups by accumulating a
 			 * minimum number of tuples before sorting, we can't assume that a
 			 * group of tuples all have the same prefix key values. Hence we
-			 * setup the full sort tuplesort to sort by all requested sort
-			 * columns.
+			 * setup the full sort tuplesort to sort by all requested sort keys.
 			 */
 			fullsort_state = tuplesort_begin_heap(tupDesc,
 												  plannode->sort.numCols,
@@ -588,13 +617,13 @@ ExecIncrementalSort(PlanState *pstate)
 		}
 		else
 		{
-			/* Reset sort for a new prefix key group. */
+			/* Reset sort for the next batch. */
 			tuplesort_reset(fullsort_state);
 		}
 
 		/*
-		 * Calculate the remaining tuples left if the bounded and configure
-		 * both bounded sort and the minimum group size accordingly.
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
 		 */
 		if (node->bounded)
 		{
@@ -616,9 +645,9 @@ ExecIncrementalSort(PlanState *pstate)
 
 		/*
 		 * Because we have to read the next tuple to find out that we've
-		 * encountered a new prefix key group on subsequent groups we have to
+		 * encountered a new prefix key group, on subsequent groups we have to
 		 * carry over that extra tuple and add it to the new group's sort
-		 * here.
+		 * here before we read any new tuples from the outer node.
 		 */
 		if (!TupIsNull(node->group_pivot))
 		{
@@ -636,6 +665,11 @@ ExecIncrementalSort(PlanState *pstate)
 				ExecClearTuple(node->group_pivot);
 		}
 
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our current
+		 * operating mode.
+		 */
 		for (;;)
 		{
 			/*
@@ -646,11 +680,16 @@ ExecIncrementalSort(PlanState *pstate)
 			slot = ExecProcNode(outerNode);
 
 			/*
-			 * When the outer node can't provide us any more tuples, then we
-			 * can sort the current group and return those tuples.
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
 			 */
 			if (TupIsNull(slot))
 			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
 				node->outerNodeDone = true;
 
 				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
@@ -670,25 +709,31 @@ ExecIncrementalSort(PlanState *pstate)
 			if (nTuples < minGroupSize)
 			{
 				/*
-				 * If we have yet hit our target minimum group size, then
-				 * don't both with checking for inclusion in the current
-				 * prefix group since a large number of very tiny sorts is
-				 * inefficient.
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the current
+				 * prefix group since at this point we'll assume that we'll full
+				 * sort this batch to avoid a large number of very tiny (and thus
+				 * inefficient) sorts.
 				 */
 				tuplesort_puttupleslot(fullsort_state, slot);
 				nTuples++;
 
-				/* Keep the last tuple of our minimal group as a pivot. */
+				/*
+				 * If we've reach our minimum group size, then we need to store
+				 * the most recent tuple as a pivot.
+				 */
 				if (nTuples == minGroupSize)
 					ExecCopySlot(node->group_pivot, slot);
 			}
 			else
 			{
 				/*
-				 * Once we've accumulated a minimum number of tuples, we start
-				 * checking for a new prefix key group. Only after we find
-				 * changed prefix keys can we guarantee sort stability of the
-				 * tuples we've already accumulated.
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of that
+				 * prefix key group. Only after we find changed prefix keys can
+				 * we guarantee sort stability of the tuples we've already
+				 * accumulated.
 				 */
 				if (isCurrentGroup(node, node->group_pivot, slot))
 				{
@@ -703,11 +748,10 @@ ExecIncrementalSort(PlanState *pstate)
 				{
 					/*
 					 * Since the tuple we fetched isn't part of the current
-					 * prefix key group we can't sort it as part of this sort
-					 * group. Instead we need to carry it over to the next
-					 * group. We use the group_pivot slot as a temp container
-					 * for that purpose even though we won't actually treat it
-					 * as a group pivot.
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot to
+					 * carry it over to the next batch (even though we won't
+					 * actually treat it as a group pivot).
 					 */
 					ExecCopySlot(node->group_pivot, slot);
 
@@ -717,8 +761,8 @@ ExecIncrementalSort(PlanState *pstate)
 						 * If the current node has a bound, and we've already
 						 * sorted n tuples, then the functional bound
 						 * remaining is (original bound - n), so store the
-						 * current number of processed tuples for use in
-						 * configuring sorting bound.
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
 						 */
 						SO2_printf("Changing bound_Done from %ld to %ld\n",
 								   node->bound_Done,
@@ -728,7 +772,8 @@ ExecIncrementalSort(PlanState *pstate)
 
 					/*
 					 * Once we find changed prefix keys we can complete the
-					 * sort and begin reading out the sorted tuples.
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
 					 */
 					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
 							   nTuples);
@@ -746,18 +791,24 @@ ExecIncrementalSort(PlanState *pstate)
 			}
 
 			/*
-			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
-			 * then we make the assumption that it's likely that we've found a
-			 * large group of tuples having a single prefix key (as long as
-			 * the last tuple didn't shift us into reading from the full sort
-			 * mode tuplesort).
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key group),
+			 * so we need to transition in to presorted prefix mode.
 			 */
 			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
 				node->execution_status != INCSORT_READFULLSORT)
 			{
 				/*
 				 * The group pivot we have stored has already been put into
-				 * the tuplesort; we don't want to carry it over.
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function manage
+				 * this state for us.
 				 */
 				ExecClearTuple(node->group_pivot);
 
@@ -783,7 +834,7 @@ ExecIncrementalSort(PlanState *pstate)
 				 * haven't yet completed fetching the current prefix key group
 				 * because the tuples we've "lost" already sorted "below" the
 				 * retained ones, and we're already contractually guaranteed
-				 * to not need any more than the currentBount tuples.
+				 * to not need any more than the currentBound tuples.
 				 */
 				if (tuplesort_used_bound(node->fullsort_state))
 				{
@@ -798,10 +849,9 @@ ExecIncrementalSort(PlanState *pstate)
 						   nTuples);
 
 				/*
-				 * Track the number of tuples we need to move from the
-				 * fullsort to presorted prefix sort (we might have multiple
-				 * prefix key groups, so we need a way to see if we've
-				 * actually finished).
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the
+				 * it needs to move from the fullsort to presorted prefix sort.
 				 */
 				node->n_fullsort_remaining = nTuples;
 
@@ -828,50 +878,62 @@ ExecIncrementalSort(PlanState *pstate)
 	if (node->execution_status == INCSORT_LOADPREFIXSORT)
 	{
 		/*
-		 * Since we only enter this state after determining that all remaining
-		 * tuples in the full sort tuplesort have the same prefix, we've
-		 * already established a current group pivot tuple (but wasn't carried
-		 * over; it's already been put into the prefix sort tuplesort).
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the same
+		 * prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
 		 */
 		Assert(!TupIsNull(node->group_pivot));
 
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
 		for (;;)
 		{
 			slot = ExecProcNode(outerNode);
 
-			/* Check to see if there are no more tuples to fetch. */
+			/*
+			 * If we've exhausted tuples from the outer node we're done loading
+			 * the prefix sort state.
+			 */
 			if (TupIsNull(slot))
 			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
 				node->outerNodeDone = true;
 				break;
 			}
 
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not done
+			 * yet and can load it into the prefix sort state. If not, we don't
+			 * want to  sort it as part of the current batch. Instead we use the
+			 * group_pivot slot to carry it over to the next batch (even though
+			 * we won't actually treat it as a group pivot).
+			 */
 			if (isCurrentGroup(node, node->group_pivot, slot))
 			{
-				/*
-				 * Fetch tuples and put them into the presorted prefix
-				 * tuplesort until we find changed prefix keys. Only then can
-				 * we guarantee sort stability of the tuples we've already
-				 * accumulated.
-				 */
 				tuplesort_puttupleslot(node->prefixsort_state, slot);
 				nTuples++;
 			}
 			else
 			{
-				/*
-				 * Since the tuple we fetched isn't part of the current prefix
-				 * key group we can't sort it as part of this sort group.
-				 * Instead we need to carry it over to the next group. We use
-				 * the group_pivot slot as a temp container for that purpose
-				 * even though we won't actually treat it as a group pivot.
-				 */
 				ExecCopySlot(node->group_pivot, slot);
 				break;
 			}
 		}
 
-		/* Perform the sort and return the tuples to the inner plan nodes. */
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
 		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
 		tuplesort_performsort(node->prefixsort_state);
 
@@ -935,16 +997,14 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 
 	/*
 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
-	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
-	 * bucket in tuplesortstate.
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
 	 */
 	Assert((eflags & (EXEC_FLAG_REWIND |
 					  EXEC_FLAG_BACKWARD |
 					  EXEC_FLAG_MARK)) == 0);
 
-	/*
-	 * create state structure
-	 */
+	/* Initialize state structure. */
 	incrsortstate = makeNode(IncrementalSortState);
 	incrsortstate->ss.ps.plan = (Plan *) node;
 	incrsortstate->ss.ps.state = estate;
@@ -990,7 +1050,7 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 	 */
 
 	/*
-	 * initialize child nodes
+	 * Initialize child nodes.
 	 *
 	 * We shield the child node from the need to support REWIND, BACKWARD, or
 	 * MARK/RESTORE.
@@ -1006,12 +1066,15 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 
 	/*
 	 * Initialize return slot and type. No need to initialize projection info
-	 * because this node doesn't do projections.
+	 * because we don't do any projections.
 	 */
 	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
 	incrsortstate->ss.ps.ps_ProjInfo = NULL;
 
-	/* make standalone slot to store previous tuple from outer node */
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
 	incrsortstate->group_pivot =
 		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
 								 &TTSOpsMinimalTuple);
@@ -1033,13 +1096,11 @@ ExecEndIncrementalSort(IncrementalSortState *node)
 {
 	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
 
-	/*
-	 * clean out the tuple table
-	 */
+	/* clean out the scan tuple */
 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
 	/* must drop pointer to sort result tuple */
 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
-	/* must drop stanalone tuple slot from outer node */
+	/* must drop stanalone tuple slots from outer node */
 	ExecDropSingleTupleTableSlot(node->group_pivot);
 	ExecDropSingleTupleTableSlot(node->transfer_tuple);
 
@@ -1086,6 +1147,8 @@ ExecReScanIncrementalSort(IncrementalSortState *node)
 	node->outerNodeDone = false;
 
 	/*
+	 * XXX: This is suspect.
+	 *
 	 * If subnode is to be rescanned then we forget previous sort results; we
 	 * have to re-read the subplan and re-sort.  Also must re-sort if the
 	 * bounded-sort parameters changed or we didn't select randomAccess.
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8efbb660b9..a59926fa02 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1802,7 +1802,7 @@ cost_full_sort(Cost *startup_cost, Cost *run_cost,
 /*
  * cost_incremental_sort
  * 	Determines and returns the cost of sorting a relation incrementally, when
- *  the input path is already sorted by some of the pathkeys.
+ *  the input path is presorted by a prefix of the pathkeys.
  *
  * 'presorted_keys' is the number of leading pathkeys by which the input path
  * is sorted.
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 74799cd8fd..be569f56fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -1831,9 +1831,10 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
@@ -1849,11 +1850,6 @@ pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
 										&n_common_pathkeys);
 
-	/*
-	 * Return the number of path keys in common, or 0 if there are none. Any
-	 * leading common pathkeys could be useful for ordering because we can use
-	 * the incremental sort.
-	 */
 	return n_common_pathkeys;
 }
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c194263cf8..423ac25827 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,8 +4922,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new paths we need consider is an explicit full or
- * incremental sort on the cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 77c15ebd78..99d64a88af 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -804,7 +804,11 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 }
 
 /*
- * XXX Missing comment.
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
  */
 static void
 tuplesort_begin_batch(Tuplesortstate *state)
@@ -1291,8 +1295,11 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+
 /*
- * XXX Missing comment.
+ * tuplesort_used_bound
+ *
+ * Allow callers to find out if the sort state was able to use a bound.
  */
 bool
 tuplesort_used_bound(Tuplesortstate *state)
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e6a9b67675..71ac1417ab 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2023,6 +2023,10 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instruementation information for IncrementalSort
+ * ----------------
+ */
 typedef struct IncrementalSortGroupInfo
 {
 	int64		groupCount;
-- 
2.17.1

v40-0002-comment-rewording-typo.patchtext/x-patch; charset=US-ASCII; name=v40-0002-comment-rewording-typo.patchDownload

From 88ceef0e5bbc5497861cae085660be06878403b3 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sun, 22 Mar 2020 18:41:46 -0400
Subject: [PATCH v40 2/7] comment rewording/typo

---
 src/backend/optimizer/util/pathnode.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 9c8f3b1f0b..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -735,9 +735,9 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  need to consider the row counts as a measure of quality: every path will
  *	  produce the same number of rows.  It may however matter how much the
  *	  path ordering matches the final ordering, needed by upper parts of the
- *	  plan, because that will affect how expensive the incremental sort is.
- *	  because of that we need to consider both the total and startup path,
- *	  in addition to pathkeys.
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -781,7 +781,7 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		 * path has lower startup cost, the other has lower total cost.
 		 *
 		 * XXX Perhaps we could do this only when incremental sort is enabled,
-		 * and use the simpler version (compring just total cost) otherwise?
+		 * and use the simpler version (comparing just total cost) otherwise?
 		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-- 
2.17.1

v40-0005-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v40-0005-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From 520c507e32725a53ab6a1f45219e4d57ca63f851 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v40 5/7] Consider incremental sort paths in additional places

---
 src/backend/optimizer/path/allpaths.c | 237 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  | 130 +++++++++++++-
 src/include/optimizer/paths.h         |   2 +
 3 files changed, 366 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..6838a238cd 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,239 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might let us
+	 * avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * We can't use incremental sort for pathkeys containing volatile
+			 * expressions. We could walk the exppression itself, but checking
+			 * ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3132,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 423ac25827..35e770f241 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6431,7 +6431,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6490,6 +6492,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6816,7 +6892,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6851,6 +6929,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7232,7 +7360,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..f6994779de 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
-- 
2.17.1

v40-0003-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v40-0003-Implement-incremental-sort.patchDownload

From a5790b2d095c9229d6674e5dcc14425321f4cb84 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v40 3/7] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1201 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   74 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  302 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   77 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1320 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   88 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3863 insertions(+), 158 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 70854ae298..fe77f8eb4c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4542,6 +4542,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 58141d8393..0256dd42f1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, sortMethodName);
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	if (!(es->analyze && incrsortstate->sort_Done))
+		return;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+	if (fullsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * XXX: The previous version of the patch chcked:
+			 * fullsort_instrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+			 * and continued if the condition was true (with the comment
+			 * "ignore any unfilled slots"). I'm not convinced that makes
+			 * sense since the same sort instrument can have been used
+			 * multiple times, so the last time it being used being still in
+			 * progress, doesn't seem to be relevant. Instead I'm now checking
+			 * to see if the group count for each group info is 0. If both are
+			 * 0, then we exclude the worker since it didn't contribute
+			 * anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..32ce05a63c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1201 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next set up tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		if (lastTuple)
+
+			/*
+			 * We retain the current group pivot tuple since we haven't yet
+			 * found the end of the current prefix key group.
+			 */
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort batch sorter, so we'll sort this batch, let the inner
+		 * node read out all of those tuples, and then come back around to
+		 * find another batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will NULL the slot if no more tuples
+		 * remain. If the tuplesort is empty, but we don't have any more
+		 * tuples avaialable for sort from the outer node, then outerNodeDone
+		 * will have been set so we'll return the empty slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * TODO: there isn't a good test case for the node->outerNodeDone
+			 * case directly, but lots of other stuff fails if it's not there.
+			 * If the outer node will fail when trying to fetch too many
+			 * tuples, then things break if this check isn't here.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the inner
+			 * node, but if we tuples remaining in that tuplesort (i.e.,
+			 * n_fullsort_remaining > 0) at this point we need to do that
+			 * again.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any already sorted tuples to read, and we're
+			 * not in the middle of transitioning into presorted prefix sort
+			 * mode, then it's time to start the process all over again by
+			 * building new full sort group.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Want to scan subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize tuplesort module (only needed before the first group).
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup().
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * columns.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for a new prefix key group. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if the bounded and configure
+		 * both bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort
+		 * here.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * When the outer node can't provide us any more tuples, then we
+			 * can sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we have yet hit our target minimum group size, then
+				 * don't both with checking for inclusion in the current
+				 * prefix group since a large number of very tiny sorts is
+				 * inefficient.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/* Keep the last tuple of our minimal group as a pivot. */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * Once we've accumulated a minimum number of tuples, we start
+				 * checking for a new prefix key group. Only after we find
+				 * changed prefix keys can we guarantee sort stability of the
+				 * tuples we've already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we can't sort it as part of this sort
+					 * group. Instead we need to carry it over to the next
+					 * group. We use the group_pivot slot as a temp container
+					 * for that purpose even though we won't actually treat it
+					 * as a group pivot.
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for use in
+						 * configuring sorting bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and begin reading out the sorted tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Once we've processed DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+			 * then we make the assumption that it's likely that we've found a
+			 * large group of tuples having a single prefix key (as long as
+			 * the last tuple didn't shift us into reading from the full sort
+			 * mode tuplesort).
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBount tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * Track the number of tuples we need to move from the
+				 * fullsort to presorted prefix sort (we might have multiple
+				 * prefix key groups, so we need a way to see if we've
+				 * actually finished).
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * Since we only enter this state after determining that all remaining
+		 * tuples in the full sort tuplesort have the same prefix, we've
+		 * already established a current group pivot tuple (but wasn't carried
+		 * over; it's already been put into the prefix sort tuplesort).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/* Check to see if there are no more tuples to fetch. */
+			if (TupIsNull(slot))
+			{
+				node->outerNodeDone = true;
+				break;
+			}
+
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				/*
+				 * Fetch tuples and put them into the presorted prefix
+				 * tuplesort until we find changed prefix keys. Only then can
+				 * we guarantee sort stability of the tuples we've already
+				 * accumulated.
+				 */
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * Since the tuple we fetched isn't part of the current prefix
+				 * key group we can't sort it as part of this sort group.
+				 * Instead we need to carry it over to the next group. We use
+				 * the group_pivot slot as a temp container for that purpose
+				 * even though we won't actually treat it as a group pivot.
+				 */
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/* Perform the sort and return the tuples to the inner plan nodes. */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+	 * bucket in tuplesortstate.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/*
+	 * create state structure
+	 */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * initialize child nodes
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because this node doesn't do projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/* make standalone slot to store previous tuple from outer node */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slot from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * XXX: This is suspect.
+	 *
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	node->outerNodeDone = false;
+
+	/*
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b61d..8efbb660b9 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..74799cd8fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	/*
+	 * Return the number of path keys in common, or 0 if there are none. Any
+	 * leading common pathkeys could be useful for ordering because we can use
+	 * the incremental sort.
+	 */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..1d7d4eb3e7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5da0528382..c194263cf8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,13 +4922,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4962,29 +4965,66 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..e20c055dea 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af876d1f01..b6ce724557 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..77c15ebd78 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,73 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ * XXX Missing comment.
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +882,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +958,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1053,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1131,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1174,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1292,21 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
- *
- *	Release resources and clean up.
+ * XXX Missing comment.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1364,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2758,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2808,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3305,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50f09..e6a9b67675 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2023,68 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..136d794219 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..4f6f2288a3
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1320 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+                                           QUERY PLAN                                            
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: 27kB (avg), 27kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ [                                                                +
+   {                                                              +
+     "Plan": {                                                    +
+       "Node Type": "Limit",                                      +
+       "Parallel Aware": false,                                   +
+       "Actual Rows": 55,                                         +
+       "Actual Loops": 1,                                         +
+       "Plans": [                                                 +
+         {                                                        +
+           "Node Type": "Incremental Sort",                       +
+           "Parent Relationship": "Outer",                        +
+           "Parallel Aware": false,                               +
+           "Actual Rows": 55,                                     +
+           "Actual Loops": 1,                                     +
+           "Sort Key": ["t.a", "t.b"],                            +
+           "Presorted Key": ["t.a"],                              +
+           "Full-sort Groups": {                                  +
+             "Group Count": 2,                                    +
+             "Sort Methods Used": ["quicksort", "top-N heapsort"],+
+             "Average Sort Space Used": 27,                       +
+             "Maximum Sort Space Used": 27,                       +
+             "Sort Space Type": "Memory"                          +
+           },                                                     +
+           "Plans": [                                             +
+             {                                                    +
+               "Node Type": "Sort",                               +
+               "Parent Relationship": "Outer",                    +
+               "Parallel Aware": false,                           +
+               "Actual Rows": 100,                                +
+               "Actual Loops": 1,                                 +
+               "Sort Key": ["t.a"],                               +
+               "Sort Method": "quicksort",                        +
+               "Sort Space Used": 30,                             +
+               "Sort Space Type": "Memory",                       +
+               "Plans": [                                         +
+                 {                                                +
+                   "Node Type": "Seq Scan",                       +
+                   "Parent Relationship": "Outer",                +
+                   "Parallel Aware": false,                       +
+                   "Relation Name": "t",                          +
+                   "Alias": "t",                                  +
+                   "Actual Rows": 100,                            +
+                   "Actual Loops": 1                              +
+                 }                                                +
+               ]                                                  +
+             }                                                    +
+           ]                                                      +
+         }                                                        +
+       ]                                                          +
+     },                                                           +
+     "Triggers": [                                                +
+     ]                                                            +
+   }                                                              +
+ ]
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: 28kB (avg), 28kB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: 25kB (avg), 25kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+                    QUERY PLAN                     
+---------------------------------------------------
+ [                                                +
+   {                                              +
+     "Plan": {                                    +
+       "Node Type": "Limit",                      +
+       "Parallel Aware": false,                   +
+       "Actual Rows": 70,                         +
+       "Actual Loops": 1,                         +
+       "Plans": [                                 +
+         {                                        +
+           "Node Type": "Incremental Sort",       +
+           "Parent Relationship": "Outer",        +
+           "Parallel Aware": false,               +
+           "Actual Rows": 70,                     +
+           "Actual Loops": 1,                     +
+           "Sort Key": ["t.a", "t.b"],            +
+           "Presorted Key": ["t.a"],              +
+           "Full-sort Groups": {                  +
+             "Group Count": 1,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 28,       +
+             "Maximum Sort Space Used": 28,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Presorted Groups": {                  +
+             "Group Count": 5,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 25,       +
+             "Maximum Sort Space Used": 25,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Plans": [                             +
+             {                                    +
+               "Node Type": "Sort",               +
+               "Parent Relationship": "Outer",    +
+               "Parallel Aware": false,           +
+               "Actual Rows": 100,                +
+               "Actual Loops": 1,                 +
+               "Sort Key": ["t.a"],               +
+               "Sort Method": "quicksort",        +
+               "Sort Space Used": 30,             +
+               "Sort Space Type": "Memory",       +
+               "Plans": [                         +
+                 {                                +
+                   "Node Type": "Seq Scan",       +
+                   "Parent Relationship": "Outer",+
+                   "Parallel Aware": false,       +
+                   "Relation Name": "t",          +
+                   "Alias": "t",                  +
+                   "Actual Rows": 100,            +
+                   "Actual Loops": 1              +
+                 }                                +
+               ]                                  +
+             }                                    +
+           ]                                      +
+         }                                        +
+       ]                                          +
+     },                                           +
+     "Triggers": [                                +
+     ]                                            +
+   }                                              +
+ ]
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..9320a10b91
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,88 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

v40-0006-A-couple-more-places-for-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v40-0006-A-couple-more-places-for-incremental-sort.patchDownload

From 9af4cf104c95fce9e5cce457c4ae3fce385f615b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH v40 6/7] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/plan/planner.c   | 216 ++++++++++++++++++++++++-
 2 files changed, 214 insertions(+), 4 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 35e770f241..4dd2757ca2 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5077,6 +5077,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6577,12 +6638,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6613,6 +6680,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6884,6 +7001,58 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/* Consider incremental sort on all partial paths, if enabled. */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -7076,10 +7245,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7105,6 +7275,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7206,7 +7416,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.17.1

v40-0007-fix.patchtext/x-patch; charset=US-ASCII; name=v40-0007-fix.patchDownload

From a2e24cf863f34aaf3cdf367840542ae2a6333b1c Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Mon, 23 Mar 2020 22:19:12 -0400
Subject: [PATCH v40 7/7] fix

---
 src/backend/optimizer/plan/planner.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4dd2757ca2..881302d0a3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6572,7 +6572,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			/* We've already skipped fully sorted paths above. */
 			Assert(!is_sorted);
 
-			/* no shared prefix, not point in building incremental sort */
+			/* no shared prefix, no point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
-- 
2.17.1

#231

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#230)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 2020-Mar-23, James Coleman wrote:

4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
is suspect. I've mentioned previously I don't have a great mental
model of how rescan works and its invariants (IIRC someone said it was
about moving around a result set in a cursor). Regardless I'm pretty
sure this code just doesn't work correctly.

I don't think that's the whole of it. My own vague understanding of
ReScan is that it's there to support running a node again, possibly with
different parameters. For example if you have a join of an indexscan
on the outer side and an incremental sort on the inner side, and the
values from the index are used as parameters to the incremental sort,
then the incremental sort is going to receive ReScan calls for each of
the values that the index returns. Sometimes the index could give you
the same values as before (because there's a dupe in the index), so you
can just return the same values from the incremental sort; but other
times it's going to return different values so you need to reset the
incremental sort to "start from scratch" using the new values as
parameters.

Now, if you have a cursor reading from the incremental sort and fetch
all tuples, then rewind completely and fetch all again, then that's
going to be a rescan as well.

I agree with you that the code doesn't seem to implement that.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#232

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Alvaro Herrera (#231)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Mar 23, 2020 at 11:44 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2020-Mar-23, James Coleman wrote:

4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
is suspect. I've mentioned previously I don't have a great mental
model of how rescan works and its invariants (IIRC someone said it was
about moving around a result set in a cursor). Regardless I'm pretty
sure this code just doesn't work correctly.

I don't think that's the whole of it. My own vague understanding of
ReScan is that it's there to support running a node again, possibly with
different parameters. For example if you have a join of an indexscan
on the outer side and an incremental sort on the inner side, and the
values from the index are used as parameters to the incremental sort,
then the incremental sort is going to receive ReScan calls for each of
the values that the index returns. Sometimes the index could give you
the same values as before (because there's a dupe in the index), so you
can just return the same values from the incremental sort; but other
times it's going to return different values so you need to reset the
incremental sort to "start from scratch" using the new values as
parameters.

Now, if you have a cursor reading from the incremental sort and fetch
all tuples, then rewind completely and fetch all again, then that's
going to be a rescan as well.

I agree with you that the code doesn't seem to implement that.

I grepped the codebase for rescan, and noted this relevant info in
src/backend/executor/README:

* Rescan command to reset a node and make it generate its output sequence
over again.

* Parameters that can alter a node's results. After adjusting a parameter,
the rescan command must be applied to that node and all nodes above it.
There is a moderately intelligent scheme to avoid rescanning nodes
unnecessarily (for example, Sort does not rescan its input if no parameters
of the input have changed, since it can just reread its stored sorted data).

That jives pretty well with what you're saying.

The interesting thing with incremental sort, as the comments in the
patch already note, is that even if the params haven't changed, we
can't regenerate the same values again *unless* we know that we're
still in the same batch, or, have only processed a single full batch
(and the tuples are still in the full sort state) or we've
transitioned to prefix mode and have only transferred tuples from the
full sort state for a single prefix key group.

That's a pretty narrow range of applicability of not needing to
re-execute the entire node, at least based on my assumptions about
when rescanning will typically happen.

So, two followup questions:

1. Given the narrow applicability, might it make sense to just say
"we're only going to do a total reset and rescan and not try to
implement a smart 'don't rescan if we don't have to'"?

2. What would be a typical or good way to test this? Should I
basically repeat many of the existing implementation tests but with a
cursor and verify that rescanning produces the same results? That's
probably the path I'm going to take if there are no objections. Of
course we would need even more testing if we wanted to have the "smart
rescan" functionality.

Thoughts?

James

#233

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#232)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 24, 2020 at 06:26:11PM -0400, James Coleman wrote:

On Mon, Mar 23, 2020 at 11:44 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2020-Mar-23, James Coleman wrote:

4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
is suspect. I've mentioned previously I don't have a great mental
model of how rescan works and its invariants (IIRC someone said it was
about moving around a result set in a cursor). Regardless I'm pretty
sure this code just doesn't work correctly.

I don't think that's the whole of it. My own vague understanding of
ReScan is that it's there to support running a node again, possibly with
different parameters. For example if you have a join of an indexscan
on the outer side and an incremental sort on the inner side, and the
values from the index are used as parameters to the incremental sort,
then the incremental sort is going to receive ReScan calls for each of
the values that the index returns. Sometimes the index could give you
the same values as before (because there's a dupe in the index), so you
can just return the same values from the incremental sort; but other
times it's going to return different values so you need to reset the
incremental sort to "start from scratch" using the new values as
parameters.

Now, if you have a cursor reading from the incremental sort and fetch
all tuples, then rewind completely and fetch all again, then that's
going to be a rescan as well.

I agree with you that the code doesn't seem to implement that.

I grepped the codebase for rescan, and noted this relevant info in
src/backend/executor/README:

* Rescan command to reset a node and make it generate its output sequence
over again.

* Parameters that can alter a node's results. After adjusting a parameter,
the rescan command must be applied to that node and all nodes above it.
There is a moderately intelligent scheme to avoid rescanning nodes
unnecessarily (for example, Sort does not rescan its input if no parameters
of the input have changed, since it can just reread its stored sorted data).

That jives pretty well with what you're saying.

The interesting thing with incremental sort, as the comments in the
patch already note, is that even if the params haven't changed, we
can't regenerate the same values again *unless* we know that we're
still in the same batch, or, have only processed a single full batch
(and the tuples are still in the full sort state) or we've
transitioned to prefix mode and have only transferred tuples from the
full sort state for a single prefix key group.

That's a pretty narrow range of applicability of not needing to
re-execute the entire node, at least based on my assumptions about
when rescanning will typically happen.

So, two followup questions:

1. Given the narrow applicability, might it make sense to just say
"we're only going to do a total reset and rescan and not try to
implement a smart 'don't rescan if we don't have to'"?

I think that's a sensible approach.

2. What would be a typical or good way to test this? Should I
basically repeat many of the existing implementation tests but with a
cursor and verify that rescanning produces the same results? That's
probably the path I'm going to take if there are no objections. Of
course we would need even more testing if we wanted to have the "smart
rescan" functionality.

I haven't checked, but how are we testing it for the other nodes?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#234

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#233)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 24, 2020 at 7:08 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 24, 2020 at 06:26:11PM -0400, James Coleman wrote:

On Mon, Mar 23, 2020 at 11:44 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2020-Mar-23, James Coleman wrote:

4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
is suspect. I've mentioned previously I don't have a great mental
model of how rescan works and its invariants (IIRC someone said it was
about moving around a result set in a cursor). Regardless I'm pretty
sure this code just doesn't work correctly.

I don't think that's the whole of it. My own vague understanding of
ReScan is that it's there to support running a node again, possibly with
different parameters. For example if you have a join of an indexscan
on the outer side and an incremental sort on the inner side, and the
values from the index are used as parameters to the incremental sort,
then the incremental sort is going to receive ReScan calls for each of
the values that the index returns. Sometimes the index could give you
the same values as before (because there's a dupe in the index), so you
can just return the same values from the incremental sort; but other
times it's going to return different values so you need to reset the
incremental sort to "start from scratch" using the new values as
parameters.

Now, if you have a cursor reading from the incremental sort and fetch
all tuples, then rewind completely and fetch all again, then that's
going to be a rescan as well.

I agree with you that the code doesn't seem to implement that.

I grepped the codebase for rescan, and noted this relevant info in
src/backend/executor/README:

* Rescan command to reset a node and make it generate its output sequence
over again.

* Parameters that can alter a node's results. After adjusting a parameter,
the rescan command must be applied to that node and all nodes above it.
There is a moderately intelligent scheme to avoid rescanning nodes
unnecessarily (for example, Sort does not rescan its input if no parameters
of the input have changed, since it can just reread its stored sorted data).

That jives pretty well with what you're saying.

The interesting thing with incremental sort, as the comments in the
patch already note, is that even if the params haven't changed, we
can't regenerate the same values again *unless* we know that we're
still in the same batch, or, have only processed a single full batch
(and the tuples are still in the full sort state) or we've
transitioned to prefix mode and have only transferred tuples from the
full sort state for a single prefix key group.

That's a pretty narrow range of applicability of not needing to
re-execute the entire node, at least based on my assumptions about
when rescanning will typically happen.

So, two followup questions:

1. Given the narrow applicability, might it make sense to just say
"we're only going to do a total reset and rescan and not try to
implement a smart 'don't rescan if we don't have to'"?

I think that's a sensible approach.

2. What would be a typical or good way to test this? Should I
basically repeat many of the existing implementation tests but with a
cursor and verify that rescanning produces the same results? That's
probably the path I'm going to take if there are no objections. Of
course we would need even more testing if we wanted to have the "smart
rescan" functionality.

I haven't checked, but how are we testing it for the other nodes?

I haven't checked yet, but figured I'd ask in case someone had ideas
off the top of their head.

While working on finding a test case to show rescan isn't implemented
properly yet, I came across a bug. At the top of
ExecInitIncrementalSort, we assert that eflags does not contain
EXEC_FLAG_REWIND. But the following query (with merge and hash joins
disabled) breaks that assertion:

select * from t join (select * from t order by a, b) s on s.a = t.a
where t.a in (1,2);

The comments about this flag in src/include/executor/executor.h say:

* REWIND indicates that the plan node should try to efficiently support
* rescans without parameter changes. (Nodes must support ExecReScan calls
* in any case, but if this flag was not given, they are at liberty to do it
* through complete recalculation. Note that a parameter change forces a
* full recalculation in any case.)

Now we know that except in rare cases (as just discussed recently up
thread) we can't implement rescan efficiently.

So is this a planner bug (i.e., should we try not to generate
incremental sort plans that require efficient rewind)? Or can we just
remove that part of the assertion and know that we'll implement the
rescan, albeit inefficiently? We already explicitly declare that we
don't support backwards scanning, but I don't see a way to declare the
same for rewind.

I'm going to try the latter approach now to see if it at least solves
the immediate problem...

James

#235

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#234)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 24, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:

While working on finding a test case to show rescan isn't implemented
properly yet, I came across a bug. At the top of
ExecInitIncrementalSort, we assert that eflags does not contain
EXEC_FLAG_REWIND. But the following query (with merge and hash joins
disabled) breaks that assertion:

select * from t join (select * from t order by a, b) s on s.a = t.a
where t.a in (1,2);

The comments about this flag in src/include/executor/executor.h say:

* REWIND indicates that the plan node should try to efficiently support
* rescans without parameter changes. (Nodes must support ExecReScan calls
* in any case, but if this flag was not given, they are at liberty to do it
* through complete recalculation. Note that a parameter change forces a
* full recalculation in any case.)

Now we know that except in rare cases (as just discussed recently up
thread) we can't implement rescan efficiently.

So is this a planner bug (i.e., should we try not to generate
incremental sort plans that require efficient rewind)? Or can we just
remove that part of the assertion and know that we'll implement the
rescan, albeit inefficiently? We already explicitly declare that we
don't support backwards scanning, but I don't see a way to declare the
same for rewind.

Other nodes seem to get a materialization node placed above them to
support this case "better". Is that something we should be doing?

I'm going to try the latter approach now to see if it at least solves
the immediate problem...

So a couple of interesting results here. First, it does seem to fix
the assertion failure, and rescan is being used in this case (as I
expected).

The plans have a bit of a weird shape, at least in my mind. First, to
get the incremental sort on the right side of the join I had to:
set enable_mergejoin = off;
set enable_hashjoin = off;
and got this plan:

Gather (cost=1000.47..108541.96 rows=2 width=16)
Workers Planned: 2
-> Nested Loop (cost=0.47..107541.76 rows=1 width=16)
Join Filter: (t.a = t_1.a)
-> Parallel Seq Scan on t (cost=0.00..9633.33 rows=1 width=8)
Filter: (a = ANY ('{1,2}'::integer[]))
-> Incremental Sort (cost=0.47..75408.43 rows=1000000 width=8)
Sort Key: t_1.a, t_1.b
Presorted Key: t_1.a
-> Index Scan using idx_t_a on t t_1
(cost=0.42..30408.42 rows=1000000 width=8)

To get rid of the parallelism but keep the same basic plan shape I had
to further:
set enable_seqscan = off;
set enable_material = off;
and got this plan:

Nested Loop (cost=0.89..195829.74 rows=2 width=16)
Join Filter: (t.a = t_1.a)
-> Index Scan using idx_t_a on t (cost=0.42..12.88 rows=2 width=8)
Index Cond: (a = ANY ('{1,2}'::integer[]))
-> Incremental Sort (cost=0.47..75408.43 rows=1000000 width=8)
Sort Key: t_1.a, t_1.b
Presorted Key: t_1.a
-> Index Scan using idx_t_a on t t_1 (cost=0.42..30408.42
rows=1000000 width=8)

Two observations:
1. Ideally the planner would have realized that the join condition can
be safely pushed down into both index scans. I was surprised this
didn't happen, but...maybe that's just not supported?

2. Ideally the nested loop node would have the smarts to know that the
right child is ordered, and therefore it can stop pulling nodes from
it as soon as it stops matching the join condition for each iteration
in the loop. I'm less surprised this isn't supported; it seems like a
fairly advanced optimization (OTOH it is the kind of interesting
optimization incremental sort opens up in more cases.

I don't *think* either of these are issues with the patch, but wanted
to mention them in case it piqued someone's curiosity or in case it
actually is a bug [in our patch or otherwise].

James

#236

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#235)

5 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 24, 2020 at 8:58 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Mar 24, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:

While working on finding a test case to show rescan isn't implemented
properly yet, I came across a bug. At the top of
ExecInitIncrementalSort, we assert that eflags does not contain
EXEC_FLAG_REWIND. But the following query (with merge and hash joins
disabled) breaks that assertion:

select * from t join (select * from t order by a, b) s on s.a = t.a
where t.a in (1,2);

The comments about this flag in src/include/executor/executor.h say:

* REWIND indicates that the plan node should try to efficiently support
* rescans without parameter changes. (Nodes must support ExecReScan calls
* in any case, but if this flag was not given, they are at liberty to do it
* through complete recalculation. Note that a parameter change forces a
* full recalculation in any case.)

Now we know that except in rare cases (as just discussed recently up
thread) we can't implement rescan efficiently.

So is this a planner bug (i.e., should we try not to generate
incremental sort plans that require efficient rewind)? Or can we just
remove that part of the assertion and know that we'll implement the
rescan, albeit inefficiently? We already explicitly declare that we
don't support backwards scanning, but I don't see a way to declare the
same for rewind.

Other nodes seem to get a materialization node placed above them to
support this case "better". Is that something we should be doing?

I'm going to try the latter approach now to see if it at least solves
the immediate problem...

So a couple of interesting results here. First, it does seem to fix
the assertion failure, and rescan is being used in this case (as I
expected).

I've fixed the rescan implementation and added a test. I think we
might actually be able to get away with one basic test (it just has to
exercise both full and prefix sort states). Note, this also allowed me
to get rid of the sort_Done incremental sort state attribute that I'd
wondered earlier if we'd actually need.

In the attached patch series I've collapsed my previous fix commits,
but included the changes here as a separate patch in the series so you
can see the changes more easily.

James

Attachments:

v41-0003-fix-rescan.patchtext/x-patch; charset=US-ASCII; name=v41-0003-fix-rescan.patchDownload

From 356529ff0860c9e7af91081d0dfe38c0fda04e76 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Tue, 24 Mar 2020 22:06:39 -0400
Subject: [PATCH v41 3/5] fix rescan

---
 src/backend/commands/explain.c                |  8 +--
 src/backend/executor/nodeIncrementalSort.c    | 55 +++++++++----------
 src/include/nodes/execnodes.h                 |  1 -
 .../regress/expected/incremental_sort.out     | 31 +++++++++++
 src/test/regress/sql/incremental_sort.sql     | 11 ++++
 5 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index cf8cfd31f5..56f8e1fd21 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2817,12 +2817,12 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 	IncrementalSortGroupInfo *fullsortGroupInfo;
 	IncrementalSortGroupInfo *prefixsortGroupInfo;
 
-	if (!(es->analyze && incrsortstate->sort_Done))
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
 		return;
 
-	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
-	if (fullsortGroupInfo->groupCount > 0)
-		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
 	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
 	if (prefixsortGroupInfo->groupCount > 0)
 		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 296a0c0675..53dccf3450 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -963,12 +963,6 @@ ExecIncrementalSort(PlanState *pstate)
 	/* Restore to user specified direction. */
 	estate->es_direction = dir;
 
-	/*
-	 * Remember that we've begun our scan and sort so we know how to handle
-	 * rescan.
-	 */
-	node->sort_Done = true;
-
 	/*
 	 * Get the first or next tuple from tuplesort. Returns NULL if no more
 	 * tuples.
@@ -1000,8 +994,7 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
 	 * batches in the current sort state.
 	 */
-	Assert((eflags & (EXEC_FLAG_REWIND |
-					  EXEC_FLAG_BACKWARD |
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
 					  EXEC_FLAG_MARK)) == 0);
 
 	/* Initialize state structure. */
@@ -1010,15 +1003,15 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 	incrsortstate->ss.ps.state = estate;
 	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
 
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
 	incrsortstate->bounded = false;
-	incrsortstate->sort_Done = false;
 	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
 	incrsortstate->fullsort_state = NULL;
 	incrsortstate->prefixsort_state = NULL;
 	incrsortstate->group_pivot = NULL;
 	incrsortstate->transfer_tuple = NULL;
 	incrsortstate->n_fullsort_remaining = 0;
-	incrsortstate->bound_Done = 0;
 	incrsortstate->presorted_keys = NULL;
 
 	if (incrsortstate->ss.ps.instrument != NULL)
@@ -1132,30 +1125,35 @@ ExecReScanIncrementalSort(IncrementalSortState *node)
 	PlanState  *outerPlan = outerPlanState(node);
 
 	/*
-	 * XXX: This is suspect.
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't store
+	 * all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
 	 *
-	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
-	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
-	 * re-scan it at all.
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but that
+	 * seems a narrow enough case that it's not worth handling specially at
+	 * this time.
 	 */
-	if (!node->sort_Done)
-		return;
 
 	/* must drop pointer to sort result tuple */
 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
 
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
 	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
 
-	/*
-	 * XXX: This is suspect.
-	 *
-	 * If subnode is to be rescanned then we forget previous sort results; we
-	 * have to re-read the subplan and re-sort.  Also must re-sort if the
-	 * bounded-sort parameters changed or we didn't select randomAccess.
-	 *
-	 * Otherwise we can just rewind and rescan the sorted output.
-	 */
-	node->sort_Done = false;
 	if (node->fullsort_state != NULL)
 	{
 		tuplesort_end(node->fullsort_state);
@@ -1166,11 +1164,10 @@ ExecReScanIncrementalSort(IncrementalSortState *node)
 		tuplesort_end(node->prefixsort_state);
 		node->prefixsort_state = NULL;
 	}
-	node->bound_Done = 0;
 
 	/*
-	 * if chgParam of subnode is not null then plan will be re-scanned by
-	 * first ExecProcNode.
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned by
+	 * the first ExecProcNode.
 	 */
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 71ac1417ab..6127ab5912 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2070,7 +2070,6 @@ typedef struct IncrementalSortState
 	ScanState	ss;				/* its first field is NodeTag */
 	bool		bounded;		/* is the result set bounded? */
 	int64		bound;			/* if bounded, how many tuples are needed */
-	bool		sort_Done;		/* sort completed yet? */
 	bool		outerNodeDone;	/* finished fetching tuples from outer node */
 	int64		bound_Done;		/* value of bound we did the sort with */
 	IncrementalSortExecutionStatus execution_status;
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 4f6f2288a3..689143456e 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -602,6 +602,37 @@ select * from (select * from t order by a) s order by a, b limit 70;
  9 | 70
 (70 rows)
 
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
 -- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
 explain (analyze, costs off, summary off, timing off)
 select * from (select * from t order by a) s order by a, b limit 70;
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index 9320a10b91..e567a9a14d 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -50,6 +50,17 @@ delete from t;
 insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
 select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
 -- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
 explain (analyze, costs off, summary off, timing off)
 select * from (select * from t order by a) s order by a, b limit 70;
-- 
2.17.1

v41-0004-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v41-0004-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From 1af6e4bedf03b9c86d1d4703fa76292b7f1dca9e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v41 4/5] Consider incremental sort paths in additional places

---
 src/backend/optimizer/path/allpaths.c | 237 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  | 130 +++++++++++++-
 src/include/optimizer/paths.h         |   2 +
 3 files changed, 366 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..6838a238cd 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,239 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might let us
+	 * avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * We can't use incremental sort for pathkeys containing volatile
+			 * expressions. We could walk the exppression itself, but checking
+			 * ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3132,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 423ac25827..35e770f241 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6431,7 +6431,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6490,6 +6492,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6816,7 +6892,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6851,6 +6929,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7232,7 +7360,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..f6994779de 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
-- 
2.17.1

v41-0005-A-couple-more-places-for-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v41-0005-A-couple-more-places-for-incremental-sort.patchDownload

From 33034395d2aa711a9713190599c746058adff593 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH v41 5/5] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/plan/planner.c   | 218 ++++++++++++++++++++++++-
 2 files changed, 215 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 35e770f241..881302d0a3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5077,6 +5077,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6511,7 +6572,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			/* We've already skipped fully sorted paths above. */
 			Assert(!is_sorted);
 
-			/* no shared prefix, not point in building incremental sort */
+			/* no shared prefix, no point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
@@ -6577,12 +6638,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6613,6 +6680,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6884,6 +7001,58 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/* Consider incremental sort on all partial paths, if enabled. */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -7076,10 +7245,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7105,6 +7275,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7206,7 +7416,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.17.1

v41-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v41-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From e4a0edb72e456e2aea6dcfa69d33a58302f2b22a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v41 1/5] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v41-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v41-0002-Implement-incremental-sort.patchDownload

From 50700176c7397af6715ac81e8b6b5475406d5644 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v41 2/5] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1264 ++++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   63 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   74 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  307 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   81 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1320 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   88 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3935 insertions(+), 160 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 70854ae298..fe77f8eb4c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4542,6 +4542,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 58141d8393..cf8cfd31f5 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, unconstify(char *, sortMethodName));
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	if (!(es->analyze && incrsortstate->sort_Done))
+		return;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+	if (fullsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..296a0c0675
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1264 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next tuple
+		 * from the outer plan node, so we retain the current group pivot tuple
+		 * prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to see
+	 * if there are any remaining tuples in that batch that we can return before
+	 * moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more tuples
+		 * remain. If the tuplesort is empty, but we don't have any more
+		 * tuples available for sort from the outer node, then outerNodeDone
+		 * will have been set so we'll return that now-empty slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer node
+			 * will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in the
+			 * full sort state (i.e., n_fullsort_remaining > 0), then we need to
+			 * re-execute the prefix mode transition function to pull out the
+			 * next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the initial
+			 * intialization for the full sort state (and not for the prefix
+			 * sort state) since we always load the full sort state first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort
+		 * here before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our current
+		 * operating mode.
+		 */
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the current
+				 * prefix group since at this point we'll assume that we'll full
+				 * sort this batch to avoid a large number of very tiny (and thus
+				 * inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to store
+				 * the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of that
+				 * prefix key group. Only after we find changed prefix keys can
+				 * we guarantee sort stability of the tuples we've already
+				 * accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot to
+					 * carry it over to the next batch (even though we won't
+					 * actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key group),
+			 * so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function manage
+				 * this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the
+				 * it needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the same
+		 * prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done loading
+			 * the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not done
+			 * yet and can load it into the prefix sort state. If not, we don't
+			 * want to  sort it as part of the current batch. Instead we use the
+			 * group_pivot slot to carry it over to the next batch (even though
+			 * we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Remember that we've begun our scan and sort so we know how to handle
+	 * rescan.
+	 */
+	node->sort_Done = true;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_REWIND |
+					  EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->bounded = false;
+	incrsortstate->sort_Done = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * XXX: This is suspect.
+	 *
+	 * If we haven't sorted yet, just return. If outerplan's chgParam is not
+	 * NULL then it will be re-scanned by ExecProcNode, else no reason to
+	 * re-scan it at all.
+	 */
+	if (!node->sort_Done)
+		return;
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	node->outerNodeDone = false;
+
+	/*
+	 * XXX: This is suspect.
+	 *
+	 * If subnode is to be rescanned then we forget previous sort results; we
+	 * have to re-read the subplan and re-sort.  Also must re-sort if the
+	 * bounded-sort parameters changed or we didn't select randomAccess.
+	 *
+	 * Otherwise we can just rewind and rescan the sorted output.
+	 */
+	node->sort_Done = false;
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+	node->bound_Done = 0;
+
+	/*
+	 * if chgParam of subnode is not null then plan will be re-scanned by
+	 * first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b61d..a59926fa02 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..be569f56fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1831,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..1d7d4eb3e7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5da0528382..423ac25827 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,13 +4922,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4962,29 +4965,66 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..e20c055dea 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af876d1f01..b6ce724557 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..99d64a88af 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1295,25 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1371,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2765,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2815,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3312,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50f09..71ac1417ab 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2023,72 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instruementation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		sort_Done;		/* sort completed yet? */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..136d794219 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..4f6f2288a3
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1320 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+                                           QUERY PLAN                                            
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: 27kB (avg), 27kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ [                                                                +
+   {                                                              +
+     "Plan": {                                                    +
+       "Node Type": "Limit",                                      +
+       "Parallel Aware": false,                                   +
+       "Actual Rows": 55,                                         +
+       "Actual Loops": 1,                                         +
+       "Plans": [                                                 +
+         {                                                        +
+           "Node Type": "Incremental Sort",                       +
+           "Parent Relationship": "Outer",                        +
+           "Parallel Aware": false,                               +
+           "Actual Rows": 55,                                     +
+           "Actual Loops": 1,                                     +
+           "Sort Key": ["t.a", "t.b"],                            +
+           "Presorted Key": ["t.a"],                              +
+           "Full-sort Groups": {                                  +
+             "Group Count": 2,                                    +
+             "Sort Methods Used": ["quicksort", "top-N heapsort"],+
+             "Average Sort Space Used": 27,                       +
+             "Maximum Sort Space Used": 27,                       +
+             "Sort Space Type": "Memory"                          +
+           },                                                     +
+           "Plans": [                                             +
+             {                                                    +
+               "Node Type": "Sort",                               +
+               "Parent Relationship": "Outer",                    +
+               "Parallel Aware": false,                           +
+               "Actual Rows": 100,                                +
+               "Actual Loops": 1,                                 +
+               "Sort Key": ["t.a"],                               +
+               "Sort Method": "quicksort",                        +
+               "Sort Space Used": 30,                             +
+               "Sort Space Type": "Memory",                       +
+               "Plans": [                                         +
+                 {                                                +
+                   "Node Type": "Seq Scan",                       +
+                   "Parent Relationship": "Outer",                +
+                   "Parallel Aware": false,                       +
+                   "Relation Name": "t",                          +
+                   "Alias": "t",                                  +
+                   "Actual Rows": 100,                            +
+                   "Actual Loops": 1                              +
+                 }                                                +
+               ]                                                  +
+             }                                                    +
+           ]                                                      +
+         }                                                        +
+       ]                                                          +
+     },                                                           +
+     "Triggers": [                                                +
+     ]                                                            +
+   }                                                              +
+ ]
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: 28kB (avg), 28kB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: 25kB (avg), 25kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+                    QUERY PLAN                     
+---------------------------------------------------
+ [                                                +
+   {                                              +
+     "Plan": {                                    +
+       "Node Type": "Limit",                      +
+       "Parallel Aware": false,                   +
+       "Actual Rows": 70,                         +
+       "Actual Loops": 1,                         +
+       "Plans": [                                 +
+         {                                        +
+           "Node Type": "Incremental Sort",       +
+           "Parent Relationship": "Outer",        +
+           "Parallel Aware": false,               +
+           "Actual Rows": 70,                     +
+           "Actual Loops": 1,                     +
+           "Sort Key": ["t.a", "t.b"],            +
+           "Presorted Key": ["t.a"],              +
+           "Full-sort Groups": {                  +
+             "Group Count": 1,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 28,       +
+             "Maximum Sort Space Used": 28,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Presorted Groups": {                  +
+             "Group Count": 5,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 25,       +
+             "Maximum Sort Space Used": 25,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Plans": [                             +
+             {                                    +
+               "Node Type": "Sort",               +
+               "Parent Relationship": "Outer",    +
+               "Parallel Aware": false,           +
+               "Actual Rows": 100,                +
+               "Actual Loops": 1,                 +
+               "Sort Key": ["t.a"],               +
+               "Sort Method": "quicksort",        +
+               "Sort Space Used": 30,             +
+               "Sort Space Type": "Memory",       +
+               "Plans": [                         +
+                 {                                +
+                   "Node Type": "Seq Scan",       +
+                   "Parent Relationship": "Outer",+
+                   "Parallel Aware": false,       +
+                   "Relation Name": "t",          +
+                   "Alias": "t",                  +
+                   "Actual Rows": 100,            +
+                   "Actual Loops": 1              +
+                 }                                +
+               ]                                  +
+             }                                    +
+           ]                                      +
+         }                                        +
+       ]                                          +
+     },                                           +
+     "Triggers": [                                +
+     ]                                            +
+   }                                              +
+ ]
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..9320a10b91
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,88 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

#237

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#236)

12 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

In a previous email I'd summarized remaining TODOs I'd found. Here's
an updated listed with several resolved.

Resolved:

I've decided to do this, and the attached patch series includes the change.

3. nodeIncrementalSort.c ExecIncrementalSort() in the main for loop:
* TODO: do we need to check for interrupts inside these loops or
* will the outer node handle that?

It seems like what we have is sufficient, given that the nodes (and
sort) we rely on have their own calls. The one place where someone
might make an argument otherwise would be in the mode transition
function where we copy tuples from the full sort state to the
presorted sort state. If this is a problem, let me know, and I'll
change it, but I'm proceeding under the assumption for now that it's
not.

Fixed, as described in previous email.

I've decided this doesn't seem to be a real issue, so, comment removed.

7. Not listed as a comment in the patch, but I need to modify the
testing for analyze output to parse out the memory/disk stats to the
tests are stable.

Included in the attached patch series. I use plpgsql to munge out the
space kB numbers. I also discovered two bugs in the JSON output along
the way and fixed those (memory and disk need to be output separate;
disk was using the wrong "space type" enum). Finally I also use
plpgsql to check a few invariants (for now just that max space is
greater than or equal to the average.

I think we just leave this in as a comment.

9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* Considering query_pathkeys is always worth it, because it might let us
* avoid a local sort.

That originally was a copy from the fdw code, but since the two
functions have diverged (Is that concerning? I could be confusing, but
isn't a compilation problem) I didn't move the function.

I did notice though that find_em_expr_for_rel() is wholesale copied
(and unchanged) from the fdw code, so I moved it to equivclass.c so
both places can share it.

Still remaining:

1. src/backend/optimizer/util/pathnode.c add_partial_path()
* XXX Perhaps we could do this only when incremental sort is enabled,
* and use the simpler version (comparing just total cost) otherwise?

I don't have a strong opinion here. It doesn't seem like a significant
difference in terms of cost?

5. planner.c create_ordered_paths:
* XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
* other pathkeys (grouping, ...) like generate_useful_gather_paths.

10. optimizer/path/allpaths.c generate_useful_gather_paths:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.

13. planner.c create_ordered_paths:
* XXX This is probably duplicate with the paths we already generate
* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.

Tomas, any chance you could take a look at the above XXX/questions? I
believe all of them that remain relate to the planner patches.

Thanks,
James

Attachments:

v42-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v42-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From e4a0edb72e456e2aea6dcfa69d33a58302f2b22a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v42 01/12] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v42-0004-better-rescan-sort-state-reset.patchtext/x-patch; charset=US-ASCII; name=v42-0004-better-rescan-sort-state-reset.patchDownload

From 574e5864ef586a7f39860e9b265b58f2b900af99 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Wed, 25 Mar 2020 08:36:36 -0400
Subject: [PATCH v42 04/12] better rescan sort state reset

---
 src/backend/executor/nodeIncrementalSort.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 53dccf3450..4cb7679f10 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -1154,14 +1154,20 @@ ExecReScanIncrementalSort(IncrementalSortState *node)
 
 	node->execution_status = INCSORT_LOADFULLSORT;
 
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
 	if (node->fullsort_state != NULL)
 	{
-		tuplesort_end(node->fullsort_state);
+		tuplesort_reset(node->fullsort_state);
 		node->fullsort_state = NULL;
 	}
 	if (node->prefixsort_state != NULL)
 	{
-		tuplesort_end(node->prefixsort_state);
+		tuplesort_reset(node->prefixsort_state);
 		node->prefixsort_state = NULL;
 	}
 
-- 
2.17.1

v42-0005-move-algorith-description.patchtext/x-patch; charset=US-ASCII; name=v42-0005-move-algorith-description.patchDownload

From 2d1d26e5033e26f18afca9e08d594be10e1e2f0b Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Fri, 27 Mar 2020 11:10:21 -0400
Subject: [PATCH v42 05/12] move algorith description

---
 src/backend/executor/nodeIncrementalSort.c | 38 ++++++++++++----------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 4cb7679f10..fde8822a82 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -24,7 +24,7 @@
  *		(3, 3)
  *		(3, 7)
  *
- *		Incremental sort algorithm would split the input into the following
+ *		An incremental sort algorithm would split the input into the following
  *		groups, which have equal X, and then sort them by Y individually:
  *
  *			(1, 5) (1, 2)
@@ -49,6 +49,24 @@
  *		it can start producing rows early, before sorting the whole dataset,
  *		which is a significant benefit especially for queries with LIMIT.
  *
+ *		The algorithm we've implemented here is modified from the theoretical
+ *		base described above by operating in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -467,23 +485,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
  *		ExecIncrementalSort
  *
  *		Assuming that outer subtree returns tuple presorted by some prefix
- *		of target sort columns, performs incremental sort. The implemented
- *		algorithm operates in two different modes:
- *		  - Fetching a minimum number of tuples without checking prefix key
- *		    group membership and sorting on all columns when safe.
- *		  - Fetching all tuples for a single prefix key group and sorting on
- *		    solely the unsorted columns.
- *		We always begin in the first mode, and employ a heuristic to switch
- *		into the second mode if we believe it's beneficial.
- *
- *		Sorting incrementally can potentially use less memory, avoid fetching
- *		and sorting all tuples in the the dataset, and begin returning tuples
- *		before the entire result set is available.
- *
- *		The hybrid mode approach allows us to optimize for both very small
- *		groups (where the overhead of a new tuplesort is high) and very	large
- *		groups (where we can lower cost by not having to sort on already sorted
- *		columns), albeit at some extra cost while switching between modes.
+ *		of target sort columns, performs incremental sort.
  *
  *		Conditions:
  *		  -- none.
-- 
2.17.1

v42-0003-update-docs.patchtext/x-patch; charset=US-ASCII; name=v42-0003-update-docs.patchDownload

From a49b0c462344b18d1cdf227714b642a8d5556ac1 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Wed, 25 Mar 2020 08:36:24 -0400
Subject: [PATCH v42 03/12] update docs

---
 doc/src/sgml/config.sgml                      | 12 ++++++++++--
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fe77f8eb4c..47ceea43d9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4550,8 +4550,16 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </term>
       <listitem>
        <para>
-        Enables or disables the query planner's use of incremental sort
-        steps. The default is <literal>on</literal>.
+        Enables or disables the query planner's use of incremental sort, which
+        allows the planner to take advantage of data presorted on columns
+        <literal>1..m</literal> when an ordering on columns <literal>1..n</literal>
+        (where <literal>m < n</literal>) is required. Compared to regular sorts,
+        incremental sort allows returning tuples before the entire result set
+        has been sorted, particularly enabling optimizations with
+        <literal>LIMIT</literal> queries. It may also reduce memory usage and
+        the likelihood of spilling sorts to disk, but comes at the cost of
+        increased overhead splitting the result set into multiple sorting
+        batches. The default is <literal>on</literal>.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aa44f0c9bf..bc2c2dbb1b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -359,6 +359,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
-- 
2.17.1

v42-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v42-0002-Implement-incremental-sort.patchDownload

From 50c1c212dd623bf8a8d9c8e48a35d4690e8f2067 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v42 02/12] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1261 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   63 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   74 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  307 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1351 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   99 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3973 insertions(+), 160 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 70854ae298..fe77f8eb4c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4542,6 +4542,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 58141d8393..56f8e1fd21 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, unconstify(char *, sortMethodName));
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..53dccf3450
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1261 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *		Incremental sort is an optimized variant of multikey sort for cases
+ *		when the input is already sorted by a prefix of the sort keys.  For
+ *		example when a sort by (key1, key2 ... keyN) is requested, and the
+ *		input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *		divide the input into groups where keys (key1, ... keyM) are equal,
+ *		and only sort on the remaining columns.
+ *
+ *		Consider the following example.  We have input tuples consisting of
+ *		two integers (X, Y) already presorted by X, while it's required to
+ *		sort them by both X and Y.  Let input tuples be following.
+ *
+ *		(1, 5)
+ *		(1, 2)
+ *		(2, 9)
+ *		(2, 1)
+ *		(2, 5)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort algorithm would split the input into the following
+ *		groups, which have equal X, and then sort them by Y individually:
+ *
+ *			(1, 5) (1, 2)
+ *			(2, 9) (2, 1) (2, 5)
+ *			(3, 3) (3, 7)
+ *
+ *		After sorting these groups and putting them altogether, we would get
+ *		the following result which is sorted by X and Y, as requested:
+ *
+ *		(1, 2)
+ *		(1, 5)
+ *		(2, 1)
+ *		(2, 5)
+ *		(2, 9)
+ *		(3, 3)
+ *		(3, 7)
+ *
+ *		Incremental sort may be more efficient than plain sort, particularly
+ *		on large datasets, as it reduces the amount of data to sort at once,
+ *		making it more likely it fits into work_mem (eliminating the need to
+ *		spill to disk).  But the main advantage of incremental sort is that
+ *		it can start producing rows early, before sorting the whole dataset,
+ *		which is a significant benefit especially for queries with LIMIT.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next tuple
+		 * from the outer plan node, so we retain the current group pivot tuple
+		 * prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort. The implemented
+ *		algorithm operates in two different modes:
+ *		  - Fetching a minimum number of tuples without checking prefix key
+ *		    group membership and sorting on all columns when safe.
+ *		  - Fetching all tuples for a single prefix key group and sorting on
+ *		    solely the unsorted columns.
+ *		We always begin in the first mode, and employ a heuristic to switch
+ *		into the second mode if we believe it's beneficial.
+ *
+ *		Sorting incrementally can potentially use less memory, avoid fetching
+ *		and sorting all tuples in the the dataset, and begin returning tuples
+ *		before the entire result set is available.
+ *
+ *		The hybrid mode approach allows us to optimize for both very small
+ *		groups (where the overhead of a new tuplesort is high) and very	large
+ *		groups (where we can lower cost by not having to sort on already sorted
+ *		columns), albeit at some extra cost while switching between modes.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to see
+	 * if there are any remaining tuples in that batch that we can return before
+	 * moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more tuples
+		 * remain. If the tuplesort is empty, but we don't have any more
+		 * tuples available for sort from the outer node, then outerNodeDone
+		 * will have been set so we'll return that now-empty slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer node
+			 * will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in the
+			 * full sort state (i.e., n_fullsort_remaining > 0), then we need to
+			 * re-execute the prefix mode transition function to pull out the
+			 * next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the initial
+			 * intialization for the full sort state (and not for the prefix
+			 * sort state) since we always load the full sort state first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort
+		 * here before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our current
+		 * operating mode.
+		 */
+		for (;;)
+		{
+			/*
+			 * TODO: do we need to check for interrupts inside these loops or
+			 * will the outer node handle that?
+			 */
+
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the current
+				 * prefix group since at this point we'll assume that we'll full
+				 * sort this batch to avoid a large number of very tiny (and thus
+				 * inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to store
+				 * the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of that
+				 * prefix key group. Only after we find changed prefix keys can
+				 * we guarantee sort stability of the tuples we've already
+				 * accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot to
+					 * carry it over to the next batch (even though we won't
+					 * actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key group),
+			 * so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function manage
+				 * this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the
+				 * it needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the same
+		 * prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done loading
+			 * the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not done
+			 * yet and can load it into the prefix sort state. If not, we don't
+			 * want to  sort it as part of the current batch. Instead we use the
+			 * group_pivot slot to carry it over to the next batch (even though
+			 * we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't store
+	 * all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but that
+	 * seems a narrow enough case that it's not worth handling specially at
+	 * this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned by
+	 * the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b61d..a59926fa02 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..be569f56fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1831,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..1d7d4eb3e7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5da0528382..423ac25827 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,13 +4922,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4962,29 +4965,66 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..e20c055dea 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af876d1f01..b6ce724557 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..99d64a88af 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1295,25 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1371,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2765,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2815,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3312,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50f09..6127ab5912 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2023,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instruementation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..136d794219 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..689143456e
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1351 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+                                           QUERY PLAN                                            
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: 27kB (avg), 27kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ [                                                                +
+   {                                                              +
+     "Plan": {                                                    +
+       "Node Type": "Limit",                                      +
+       "Parallel Aware": false,                                   +
+       "Actual Rows": 55,                                         +
+       "Actual Loops": 1,                                         +
+       "Plans": [                                                 +
+         {                                                        +
+           "Node Type": "Incremental Sort",                       +
+           "Parent Relationship": "Outer",                        +
+           "Parallel Aware": false,                               +
+           "Actual Rows": 55,                                     +
+           "Actual Loops": 1,                                     +
+           "Sort Key": ["t.a", "t.b"],                            +
+           "Presorted Key": ["t.a"],                              +
+           "Full-sort Groups": {                                  +
+             "Group Count": 2,                                    +
+             "Sort Methods Used": ["quicksort", "top-N heapsort"],+
+             "Average Sort Space Used": 27,                       +
+             "Maximum Sort Space Used": 27,                       +
+             "Sort Space Type": "Memory"                          +
+           },                                                     +
+           "Plans": [                                             +
+             {                                                    +
+               "Node Type": "Sort",                               +
+               "Parent Relationship": "Outer",                    +
+               "Parallel Aware": false,                           +
+               "Actual Rows": 100,                                +
+               "Actual Loops": 1,                                 +
+               "Sort Key": ["t.a"],                               +
+               "Sort Method": "quicksort",                        +
+               "Sort Space Used": 30,                             +
+               "Sort Space Type": "Memory",                       +
+               "Plans": [                                         +
+                 {                                                +
+                   "Node Type": "Seq Scan",                       +
+                   "Parent Relationship": "Outer",                +
+                   "Parallel Aware": false,                       +
+                   "Relation Name": "t",                          +
+                   "Alias": "t",                                  +
+                   "Actual Rows": 100,                            +
+                   "Actual Loops": 1                              +
+                 }                                                +
+               ]                                                  +
+             }                                                    +
+           ]                                                      +
+         }                                                        +
+       ]                                                          +
+     },                                                           +
+     "Triggers": [                                                +
+     ]                                                            +
+   }                                                              +
+ ]
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: 28kB (avg), 28kB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: 25kB (avg), 25kB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: 30kB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+                    QUERY PLAN                     
+---------------------------------------------------
+ [                                                +
+   {                                              +
+     "Plan": {                                    +
+       "Node Type": "Limit",                      +
+       "Parallel Aware": false,                   +
+       "Actual Rows": 70,                         +
+       "Actual Loops": 1,                         +
+       "Plans": [                                 +
+         {                                        +
+           "Node Type": "Incremental Sort",       +
+           "Parent Relationship": "Outer",        +
+           "Parallel Aware": false,               +
+           "Actual Rows": 70,                     +
+           "Actual Loops": 1,                     +
+           "Sort Key": ["t.a", "t.b"],            +
+           "Presorted Key": ["t.a"],              +
+           "Full-sort Groups": {                  +
+             "Group Count": 1,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 28,       +
+             "Maximum Sort Space Used": 28,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Presorted Groups": {                  +
+             "Group Count": 5,                    +
+             "Sort Methods Used": ["quicksort"],  +
+             "Average Sort Space Used": 25,       +
+             "Maximum Sort Space Used": 25,       +
+             "Sort Space Type": "Memory"          +
+           },                                     +
+           "Plans": [                             +
+             {                                    +
+               "Node Type": "Sort",               +
+               "Parent Relationship": "Outer",    +
+               "Parallel Aware": false,           +
+               "Actual Rows": 100,                +
+               "Actual Loops": 1,                 +
+               "Sort Key": ["t.a"],               +
+               "Sort Method": "quicksort",        +
+               "Sort Space Used": 30,             +
+               "Sort Space Type": "Memory",       +
+               "Plans": [                         +
+                 {                                +
+                   "Node Type": "Seq Scan",       +
+                   "Parent Relationship": "Outer",+
+                   "Parallel Aware": false,       +
+                   "Relation Name": "t",          +
+                   "Alias": "t",                  +
+                   "Actual Rows": 100,            +
+                   "Actual Loops": 1              +
+                 }                                +
+               ]                                  +
+             }                                    +
+           ]                                      +
+         }                                        +
+       ]                                          +
+     },                                           +
+     "Triggers": [                                +
+     ]                                            +
+   }                                              +
+ ]
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..e567a9a14d
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,99 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 55;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
+explain (analyze, costs off, summary off, timing off)
+select * from (select * from t order by a) s order by a, b limit 70;
+explain (analyze, costs off, summary off, timing off, format json)
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

v42-0006-remove-interrupts-TODO.patchtext/x-patch; charset=US-ASCII; name=v42-0006-remove-interrupts-TODO.patchDownload

From 1a9a8879152d976cec98ee984787649994c4104a Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Fri, 27 Mar 2020 11:10:33 -0400
Subject: [PATCH v42 06/12] remove interrupts TODO

---
 src/backend/executor/nodeIncrementalSort.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index fde8822a82..5ba9ae3b2d 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -674,11 +674,6 @@ ExecIncrementalSort(PlanState *pstate)
 		 */
 		for (;;)
 		{
-			/*
-			 * TODO: do we need to check for interrupts inside these loops or
-			 * will the outer node handle that?
-			 */
-
 			slot = ExecProcNode(outerNode);
 
 			/*
-- 
2.17.1

v42-0007-remove-test-todo.patchtext/x-patch; charset=US-ASCII; name=v42-0007-remove-test-todo.patchDownload

From 72924c391237f35861d33ce6533cf0a43c2956f7 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Fri, 27 Mar 2020 11:20:36 -0400
Subject: [PATCH v42 07/12] remove test todo

---
 src/test/regress/expected/incremental_sort.out | 3 ---
 src/test/regress/sql/incremental_sort.sql      | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 689143456e..65604b3429 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -43,9 +43,6 @@ select * from (select * from tenk1 order by four) t order by four, ten;
 (6 rows)
 
 reset work_mem;
--- TODO if an analyze happens here the plans might change; should we
--- solve by inserting extra rows or by adding a GUC that would somehow
--- forcing the time of plan we expect.
 create table t(a integer, b integer);
 -- A single large group tested around each mode transition point.
 insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index e567a9a14d..040324be32 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -16,9 +16,6 @@ explain (costs off)
 select * from (select * from tenk1 order by four) t order by four, ten;
 reset work_mem;
 
--- TODO if an analyze happens here the plans might change; should we
--- solve by inserting extra rows or by adding a GUC that would somehow
--- forcing the time of plan we expect.
 create table t(a integer, b integer);
 
 -- A single large group tested around each mode transition point.
-- 
2.17.1

v42-0008-explain-output-munging-and-bug-fixes.patchtext/x-patch; charset=US-ASCII; name=v42-0008-explain-output-munging-and-bug-fixes.patchDownload

From 193ebcd603b321ec952f3ef43130282ab3c699bf Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Fri, 27 Mar 2020 10:39:04 -0400
Subject: [PATCH v42 08/12] explain output munging and bug fixes

---
 src/backend/commands/explain.c                |  22 +-
 .../regress/expected/incremental_sort.out     | 320 ++++++++++--------
 src/test/regress/sql/incremental_sort.sql     | 118 ++++++-
 3 files changed, 311 insertions(+), 149 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 56f8e1fd21..39d51848b6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2784,26 +2784,38 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 		{
 			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
 			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
 
 			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
 			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
 								   groupInfo->maxMemorySpaceUsed, es);
-			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
-			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
 		}
 		if (groupInfo->maxDiskSpaceUsed > 0)
 		{
 			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
 			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
 
 			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
 			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
 								   groupInfo->maxDiskSpaceUsed, es);
-			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
-			ExplainPropertyText("Sort Space Type", spaceTypeName, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
 		}
 
-		ExplainCloseGroup("Incremental Sort Groups", "XXX Groups", true, es);
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
 	}
 }
 
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 65604b3429..ebb8412237 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -44,6 +44,102 @@ select * from (select * from tenk1 order by four) t order by four, ten;
 
 reset work_mem;
 create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
 -- A single large group tested around each mode transition point.
 insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
@@ -433,82 +529,59 @@ select * from (select * from t order by a) s order by a, b limit 55;
  2 | 55
 (55 rows)
 
--- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
-explain (analyze, costs off, summary off, timing off)
-select * from (select * from t order by a) s order by a, b limit 55;
-                                           QUERY PLAN                                            
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                  
 -------------------------------------------------------------------------------------------------
  Limit (actual rows=55 loops=1)
    ->  Incremental Sort (actual rows=55 loops=1)
          Sort Key: t.a, t.b
          Presorted Key: t.a
-         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: 27kB (avg), 27kB (max)
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: NNkB (avg), NNkB (max)
          ->  Sort (actual rows=100 loops=1)
                Sort Key: t.a
-               Sort Method: quicksort  Memory: 30kB
+               Sort Method: quicksort  Memory: NNkB
                ->  Seq Scan on t (actual rows=100 loops=1)
 (9 rows)
 
-explain (analyze, costs off, summary off, timing off, format json)
-select * from (select * from t order by a) s order by a, b limit 55;
-                            QUERY PLAN                             
--------------------------------------------------------------------
- [                                                                +
-   {                                                              +
-     "Plan": {                                                    +
-       "Node Type": "Limit",                                      +
-       "Parallel Aware": false,                                   +
-       "Actual Rows": 55,                                         +
-       "Actual Loops": 1,                                         +
-       "Plans": [                                                 +
-         {                                                        +
-           "Node Type": "Incremental Sort",                       +
-           "Parent Relationship": "Outer",                        +
-           "Parallel Aware": false,                               +
-           "Actual Rows": 55,                                     +
-           "Actual Loops": 1,                                     +
-           "Sort Key": ["t.a", "t.b"],                            +
-           "Presorted Key": ["t.a"],                              +
-           "Full-sort Groups": {                                  +
-             "Group Count": 2,                                    +
-             "Sort Methods Used": ["quicksort", "top-N heapsort"],+
-             "Average Sort Space Used": 27,                       +
-             "Maximum Sort Space Used": 27,                       +
-             "Sort Space Type": "Memory"                          +
-           },                                                     +
-           "Plans": [                                             +
-             {                                                    +
-               "Node Type": "Sort",                               +
-               "Parent Relationship": "Outer",                    +
-               "Parallel Aware": false,                           +
-               "Actual Rows": 100,                                +
-               "Actual Loops": 1,                                 +
-               "Sort Key": ["t.a"],                               +
-               "Sort Method": "quicksort",                        +
-               "Sort Space Used": 30,                             +
-               "Sort Space Type": "Memory",                       +
-               "Plans": [                                         +
-                 {                                                +
-                   "Node Type": "Seq Scan",                       +
-                   "Parent Relationship": "Outer",                +
-                   "Parallel Aware": false,                       +
-                   "Relation Name": "t",                          +
-                   "Alias": "t",                                  +
-                   "Actual Rows": 100,                            +
-                   "Actual Loops": 1                              +
-                 }                                                +
-               ]                                                  +
-             }                                                    +
-           ]                                                      +
-         }                                                        +
-       ]                                                          +
-     },                                                           +
-     "Triggers": [                                                +
-     ]                                                            +
-   }                                                              +
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "quicksort",                    +
+                 "top-N heapsort"                +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
  ]
 (1 row)
 
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
 delete from t;
 -- An initial small group followed by a large group.
 insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
@@ -630,90 +703,69 @@ select * from t left join (select * from (select * from t order by a) v order by
 (2 rows)
 
 rollback;
--- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
-explain (analyze, costs off, summary off, timing off)
-select * from (select * from t order by a) s order by a, b limit 70;
-                                   QUERY PLAN                                    
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                         explain_analyze_without_memory                          
 ---------------------------------------------------------------------------------
  Limit (actual rows=70 loops=1)
    ->  Incremental Sort (actual rows=70 loops=1)
          Sort Key: t.a, t.b
          Presorted Key: t.a
-         Full-sort Groups: 1 (Methods: quicksort) Memory: 28kB (avg), 28kB (max)
-         Presorted Groups: 5 (Methods: quicksort) Memory: 25kB (avg), 25kB (max)
+         Full-sort Groups: 1 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
          ->  Sort (actual rows=100 loops=1)
                Sort Key: t.a
-               Sort Method: quicksort  Memory: 30kB
+               Sort Method: quicksort  Memory: NNkB
                ->  Seq Scan on t (actual rows=100 loops=1)
 (10 rows)
 
-explain (analyze, costs off, summary off, timing off, format json)
-select * from (select * from t order by a) s order by a, b limit 70;
-                    QUERY PLAN                     
----------------------------------------------------
- [                                                +
-   {                                              +
-     "Plan": {                                    +
-       "Node Type": "Limit",                      +
-       "Parallel Aware": false,                   +
-       "Actual Rows": 70,                         +
-       "Actual Loops": 1,                         +
-       "Plans": [                                 +
-         {                                        +
-           "Node Type": "Incremental Sort",       +
-           "Parent Relationship": "Outer",        +
-           "Parallel Aware": false,               +
-           "Actual Rows": 70,                     +
-           "Actual Loops": 1,                     +
-           "Sort Key": ["t.a", "t.b"],            +
-           "Presorted Key": ["t.a"],              +
-           "Full-sort Groups": {                  +
-             "Group Count": 1,                    +
-             "Sort Methods Used": ["quicksort"],  +
-             "Average Sort Space Used": 28,       +
-             "Maximum Sort Space Used": 28,       +
-             "Sort Space Type": "Memory"          +
-           },                                     +
-           "Presorted Groups": {                  +
-             "Group Count": 5,                    +
-             "Sort Methods Used": ["quicksort"],  +
-             "Average Sort Space Used": 25,       +
-             "Maximum Sort Space Used": 25,       +
-             "Sort Space Type": "Memory"          +
-           },                                     +
-           "Plans": [                             +
-             {                                    +
-               "Node Type": "Sort",               +
-               "Parent Relationship": "Outer",    +
-               "Parallel Aware": false,           +
-               "Actual Rows": 100,                +
-               "Actual Loops": 1,                 +
-               "Sort Key": ["t.a"],               +
-               "Sort Method": "quicksort",        +
-               "Sort Space Used": 30,             +
-               "Sort Space Type": "Memory",       +
-               "Plans": [                         +
-                 {                                +
-                   "Node Type": "Seq Scan",       +
-                   "Parent Relationship": "Outer",+
-                   "Parallel Aware": false,       +
-                   "Relation Name": "t",          +
-                   "Alias": "t",                  +
-                   "Actual Rows": 100,            +
-                   "Actual Loops": 1              +
-                 }                                +
-               ]                                  +
-             }                                    +
-           ]                                      +
-         }                                        +
-       ]                                          +
-     },                                           +
-     "Triggers": [                                +
-     ]                                            +
-   }                                              +
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
  ]
 (1 row)
 
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
 delete from t;
 -- Small groups of 10 tuples each tested around each mode transition point.
 insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index 040324be32..b990b3b3de 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -18,6 +18,106 @@ reset work_mem;
 
 create table t(a integer, b integer);
 
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
 -- A single large group tested around each mode transition point.
 insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
@@ -36,11 +136,10 @@ delete from t;
 insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
 select * from (select * from t order by a) s order by a, b limit 55;
--- Test EXPLAIN ANALYZE (text output) with only a fullsort group.
-explain (analyze, costs off, summary off, timing off)
-select * from (select * from t order by a) s order by a, b limit 55;
-explain (analyze, costs off, summary off, timing off, format json)
-select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
 delete from t;
 
 -- An initial small group followed by a large group.
@@ -58,11 +157,10 @@ set local enable_sort = off;
 explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
 select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
 rollback;
--- Test EXPLAIN ANALYZE (text output) with both fullsort and presorted groups.
-explain (analyze, costs off, summary off, timing off)
-select * from (select * from t order by a) s order by a, b limit 70;
-explain (analyze, costs off, summary off, timing off, format json)
-select * from (select * from t order by a) s order by a, b limit 70;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
 delete from t;
 
 -- Small groups of 10 tuples each tested around each mode transition point.
-- 
2.17.1

v42-0009-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v42-0009-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From e8d96bed4ac66c290fb6e405fc2b6396dc71c568 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v42 09/12] Consider incremental sort paths in additional
 places

---
 src/backend/optimizer/path/allpaths.c | 237 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  | 130 +++++++++++++-
 src/include/optimizer/paths.h         |   2 +
 3 files changed, 366 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..6838a238cd 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,239 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+static Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
+
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might let us
+	 * avoid a local sort.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * We can't use incremental sort for pathkeys containing volatile
+			 * expressions. We could walk the exppression itself, but checking
+			 * ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3132,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 423ac25827..35e770f241 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6431,7 +6431,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6490,6 +6492,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6816,7 +6892,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6851,6 +6929,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7232,7 +7360,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..f6994779de 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
-- 
2.17.1

v42-0010-centralize-find_em_expr_for_rel.patchtext/x-patch; charset=US-ASCII; name=v42-0010-centralize-find_em_expr_for_rel.patchDownload

From 1e70bbfeec4c8b47b515e97f3798697ddac32166 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Fri, 27 Mar 2020 12:40:05 -0400
Subject: [PATCH v42 10/12] centralize find_em_expr_for_rel

---
 contrib/postgres_fdw/postgres_fdw.c     | 29 -------------------------
 src/backend/optimizer/path/allpaths.c   | 29 -------------------------
 src/backend/optimizer/path/equivclass.c | 28 ++++++++++++++++++++++++
 src/include/optimizer/paths.h           |  1 +
 4 files changed, 29 insertions(+), 58 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6838a238cd..85586ec97b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2727,35 +2727,6 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-static Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * get_useful_pathkeys_for_relation
  *		Determine which orderings of a relation might be useful.
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index f6994779de..665f4065a4 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -137,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.17.1

v42-0011-update-confusing-copy-paste-comment.patchtext/x-patch; charset=US-ASCII; name=v42-0011-update-confusing-copy-paste-comment.patchDownload

From 98ebd6e78ac6a2e18d3c0776c9577f4ee967d771 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Fri, 27 Mar 2020 12:40:24 -0400
Subject: [PATCH v42 11/12] update confusing copy/paste comment

---
 src/backend/optimizer/path/allpaths.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 85586ec97b..32bf734820 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2751,8 +2751,8 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 	ListCell   *lc;
 
 	/*
-	 * Considering query_pathkeys is always worth it, because it might let us
-	 * avoid a local sort.
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
 	 */
 	if (root->query_pathkeys)
 	{
-- 
2.17.1

v42-0012-A-couple-more-places-for-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v42-0012-A-couple-more-places-for-incremental-sort.patchDownload

From ceb4fbfaf33a416f9dcc2af7e4061b4e12b4b113 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH v42 12/12] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/plan/planner.c   | 218 ++++++++++++++++++++++++-
 2 files changed, 215 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 35e770f241..881302d0a3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5077,6 +5077,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6511,7 +6572,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			/* We've already skipped fully sorted paths above. */
 			Assert(!is_sorted);
 
-			/* no shared prefix, not point in building incremental sort */
+			/* no shared prefix, no point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
@@ -6577,12 +6638,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6613,6 +6680,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6884,6 +7001,58 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/* Consider incremental sort on all partial paths, if enabled. */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -7076,10 +7245,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7105,6 +7275,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7206,7 +7416,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.17.1

#238

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#237)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Mar 27, 2020 at 12:51:34PM -0400, James Coleman wrote:

In a previous email I'd summarized remaining TODOs I'd found. Here's
an updated listed with several resolved.

Resolved:

2. Not marked in the patch, but in nodeIncrementalSort.c
ExecIncrementalSort() I wonder if perhaps we should move the algorithm
discussion comments up to the file header comment. On the other hand,
I suppose it could be valuable to leave the the file header comment
more high level about the mathematical properties of incremental sort
rather than discussing the details of the hybrid mode.

I've decided to do this, and the attached patch series includes the change.

It's a bit tough to find the right balance what to put into the header
comment and what should go to function comments, but this seems mostly
reasonable. I wouldn't use the double-tab indentation and the copyright
notices should stay at the top.

3. nodeIncrementalSort.c ExecIncrementalSort() in the main for loop:
* TODO: do we need to check for interrupts inside these loops or
* will the outer node handle that?

It seems like what we have is sufficient, given that the nodes (and
sort) we rely on have their own calls. The one place where someone
might make an argument otherwise would be in the mode transition
function where we copy tuples from the full sort state to the
presorted sort state. If this is a problem, let me know, and I'll
change it, but I'm proceeding under the assumption for now that it's
not.

I think what we have now is sufficient.

4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
is suspect. I've mentioned previously I don't have a great mental
model of how rescan works and its invariants (IIRC someone said it was
about moving around a result set in a cursor). Regardless I'm pretty
sure this code just doesn't work correctly. Additionally the sort_Done
variable is poorly named; it probably would make more sense to call it
something like "scanHasBegun". I'm waiting to change it though until
cleaning up this code more holistically.

Fixed, as described in previous email.

6. regress/expected/incremental_sort.out:
-- TODO if an analyze happens here the plans might change; should we
-- solve by inserting extra rows or by adding a GUC that would somehow
-- forcing the time of plan we expect.

I've decided this doesn't seem to be a real issue, so, comment removed.

7. Not listed as a comment in the patch, but I need to modify the
testing for analyze output to parse out the memory/disk stats to the
tests are stable.

Included in the attached patch series. I use plpgsql to munge out the
space kB numbers. I also discovered two bugs in the JSON output along
the way and fixed those (memory and disk need to be output separate;
disk was using the wrong "space type" enum). Finally I also use
plpgsql to check a few invariants (for now just that max space is
greater than or equal to the average.

8. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* XXX At the moment this can only ever return a list with a single element,
* because it looks at query_pathkeys only. So we might return the pathkeys
* directly, but it seems plausible we'll want to consider other orderings
* in the future.

I think we just leave this in as a comment.

Fine with me.

As a side note here, I'm wondering if this (determining useful pathkeys)
can be made a bit smarter by looking both at query_pathkeys and pathkeys
useful for merging, similarly to what truncate_useless_pathkeys() does.
But that can be seen as an improvement of what we do now.

9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* Considering query_pathkeys is always worth it, because it might let us
* avoid a local sort.

That originally was a copy from the fdw code, but since the two
functions have diverged (Is that concerning? I could be confusing, but
isn't a compilation problem) I didn't move the function.

I think it's OK the two functions diverged, it's simply because the FDW
one needs to check other things too. But I might rework this once I look
closer at truncate_useless_pathkeys.

I did notice though that find_em_expr_for_rel() is wholesale copied
(and unchanged) from the fdw code, so I moved it to equivclass.c so
both places can share it.

Still remaining:

1. src/backend/optimizer/util/pathnode.c add_partial_path()
* XXX Perhaps we could do this only when incremental sort is enabled,
* and use the simpler version (comparing just total cost) otherwise?

I don't have a strong opinion here. It doesn't seem like a significant
difference in terms of cost?

5. planner.c create_ordered_paths:
* XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
* other pathkeys (grouping, ...) like generate_useful_gather_paths.

10. optimizer/path/allpaths.c generate_useful_gather_paths:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.

11. In the same function as the above:
* XXX Can't we skip this (maybe only for the cheapest partial path)
* when the path is already sorted? Then it's likely duplicate with
* the path created by generate_gather_paths.

12. In the same function as the above:
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

13. planner.c create_ordered_paths:
* XXX This is probably duplicate with the paths we already generate
* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.

Tomas, any chance you could take a look at the above XXX/questions? I
believe all of them that remain relate to the planner patches.

Yes, I'll take a look over the weekend.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#239

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#238)

4 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Mar 27, 2020 at 9:19 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Fri, Mar 27, 2020 at 12:51:34PM -0400, James Coleman wrote:

In a previous email I'd summarized remaining TODOs I'd found. Here's
an updated listed with several resolved.

Resolved:

2. Not marked in the patch, but in nodeIncrementalSort.c
ExecIncrementalSort() I wonder if perhaps we should move the algorithm
discussion comments up to the file header comment. On the other hand,
I suppose it could be valuable to leave the the file header comment
more high level about the mathematical properties of incremental sort
rather than discussing the details of the hybrid mode.

I've decided to do this, and the attached patch series includes the change.

It's a bit tough to find the right balance what to put into the header
comment and what should go to function comments, but this seems mostly
reasonable. I wouldn't use the double-tab indentation and the copyright
notices should stay at the top.

Fixed. I also re-ran pg_indent on the the nodeIncrementalSort.c file.

3. nodeIncrementalSort.c ExecIncrementalSort() in the main for loop:
* TODO: do we need to check for interrupts inside these loops or
* will the outer node handle that?

It seems like what we have is sufficient, given that the nodes (and
sort) we rely on have their own calls. The one place where someone
might make an argument otherwise would be in the mode transition
function where we copy tuples from the full sort state to the
presorted sort state. If this is a problem, let me know, and I'll
change it, but I'm proceeding under the assumption for now that it's
not.

I think what we have now is sufficient.

4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
is suspect. I've mentioned previously I don't have a great mental
model of how rescan works and its invariants (IIRC someone said it was
about moving around a result set in a cursor). Regardless I'm pretty
sure this code just doesn't work correctly. Additionally the sort_Done
variable is poorly named; it probably would make more sense to call it
something like "scanHasBegun". I'm waiting to change it though until
cleaning up this code more holistically.

Fixed, as described in previous email.

6. regress/expected/incremental_sort.out:
-- TODO if an analyze happens here the plans might change; should we
-- solve by inserting extra rows or by adding a GUC that would somehow
-- forcing the time of plan we expect.

I've decided this doesn't seem to be a real issue, so, comment removed.

OK

7. Not listed as a comment in the patch, but I need to modify the
testing for analyze output to parse out the memory/disk stats to the
tests are stable.

Included in the attached patch series. I use plpgsql to munge out the
space kB numbers. I also discovered two bugs in the JSON output along
the way and fixed those (memory and disk need to be output separate;
disk was using the wrong "space type" enum). Finally I also use
plpgsql to check a few invariants (for now just that max space is
greater than or equal to the average.

OK

8. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* XXX At the moment this can only ever return a list with a single element,
* because it looks at query_pathkeys only. So we might return the pathkeys
* directly, but it seems plausible we'll want to consider other orderings
* in the future.

I think we just leave this in as a comment.

Fine with me.

As a side note here, I'm wondering if this (determining useful pathkeys)
can be made a bit smarter by looking both at query_pathkeys and pathkeys
useful for merging, similarly to what truncate_useless_pathkeys() does.
But that can be seen as an improvement of what we do now.

Unless your comment below about looking at truncate_useless_pathkeys
is implying you're considering aiming to get this in now, I wonder if
we should just expand the comment to reference pathkeys useful for
merging as a possible future extension.

9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* Considering query_pathkeys is always worth it, because it might let us
* avoid a local sort.

That originally was a copy from the fdw code, but since the two
functions have diverged (Is that concerning? I could be confusing, but
isn't a compilation problem) I didn't move the function.

I think it's OK the two functions diverged, it's simply because the FDW
one needs to check other things too. But I might rework this once I look
closer at truncate_useless_pathkeys.

Agreed, for now at least. It's tempting to think they should always be
shared, but I'm not convinced (without a lot more digging) that this
represents structural rather than incidental duplication.

I did notice though that find_em_expr_for_rel() is wholesale copied
(and unchanged) from the fdw code, so I moved it to equivclass.c so
both places can share it.

+1

Still remaining:

1. src/backend/optimizer/util/pathnode.c add_partial_path()
* XXX Perhaps we could do this only when incremental sort is enabled,
* and use the simpler version (comparing just total cost) otherwise?

I don't have a strong opinion here. It doesn't seem like a significant
difference in terms of cost?

5. planner.c create_ordered_paths:
* XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
* other pathkeys (grouping, ...) like generate_useful_gather_paths.

10. optimizer/path/allpaths.c generate_useful_gather_paths:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.

11. In the same function as the above:
* XXX Can't we skip this (maybe only for the cheapest partial path)
* when the path is already sorted? Then it's likely duplicate with
* the path created by generate_gather_paths.

12. In the same function as the above:
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

13. planner.c create_ordered_paths:
* XXX This is probably duplicate with the paths we already generate
* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.

Tomas, any chance you could take a look at the above XXX/questions? I
believe all of them that remain relate to the planner patches.

Yes, I'll take a look over the weekend.

Awesome, thanks.

I collapsed things down including the changes referenced in this
email, since they were all comment formatting changes.

James

Attachments:

v43-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v43-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From e4a0edb72e456e2aea6dcfa69d33a58302f2b22a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v43 1/4] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v43-0003-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v43-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From 45b95f4631d808ed74811d32fb04c0401515cd8a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v43 3/4] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 ----
 src/backend/optimizer/path/allpaths.c   | 208 +++++++++++++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++++
 src/backend/optimizer/plan/planner.c    | 130 ++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 5 files changed, 366 insertions(+), 32 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..32bf734820 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,210 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * We can't use incremental sort for pathkeys containing volatile
+			 * expressions. We could walk the exppression itself, but checking
+			 * ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3103,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 423ac25827..35e770f241 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6431,7 +6431,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6490,6 +6492,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6816,7 +6892,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6851,6 +6929,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7232,7 +7360,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..665f4065a4 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.17.1

v43-0004-A-couple-more-places-for-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v43-0004-A-couple-more-places-for-incremental-sort.patchDownload

From 6ffe301df13f067893a81b060fcc6fab950e48b4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH v43 4/4] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/plan/planner.c   | 218 ++++++++++++++++++++++++-
 2 files changed, 215 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 35e770f241..881302d0a3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5077,6 +5077,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6511,7 +6572,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			/* We've already skipped fully sorted paths above. */
 			Assert(!is_sorted);
 
-			/* no shared prefix, not point in building incremental sort */
+			/* no shared prefix, no point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
@@ -6577,12 +6638,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6613,6 +6680,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6884,6 +7001,58 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/* Consider incremental sort on all partial paths, if enabled. */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -7076,10 +7245,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7105,6 +7275,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7206,7 +7416,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.17.1

v43-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v43-0002-Implement-incremental-sort.patchDownload

From 0ae6e4c64d16c579cd32557fafd5869f0334ada9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v43 2/4] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   22 +
 src/backend/commands/explain.c                |  223 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1267 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   63 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   74 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  307 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1400 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 40 files changed, 4144 insertions(+), 160 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 70854ae298..47ceea43d9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4542,6 +4542,28 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort, which
+        allows the planner to take advantage of data presorted on columns
+        <literal>1..m</literal> when an ordering on columns <literal>1..n</literal>
+        (where <literal>m < n</literal>) is required. Compared to regular sorts,
+        incremental sort allows returning tuples before the entire result set
+        has been sorted, particularly enabling optimizations with
+        <literal>LIMIT</literal> queries. It may also reduce memory usage and
+        the likelihood of spilling sorts to disk, but comes at the cost of
+        increased overhead splitting the result set into multiple sorting
+        batches. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 58141d8393..39d51848b6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,180 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, unconstify(char *, sortMethodName));
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..9fe93d5979
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1267 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b61d..a59926fa02 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..be569f56fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1831,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..1d7d4eb3e7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5da0528382..423ac25827 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,13 +4922,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4962,29 +4965,66 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..e20c055dea 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af876d1f01..b6ce724557 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aa44f0c9bf..bc2c2dbb1b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -359,6 +359,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..99d64a88af 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1295,25 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1371,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2765,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2815,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3312,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50f09..6127ab5912 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2023,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instruementation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..136d794219 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..ebb8412237
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1400 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                  
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: NNkB (avg), NNkB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "quicksort",                    +
+                 "top-N heapsort"                +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                         explain_analyze_without_memory                          
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

#240

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#239)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Mar 27, 2020 at 09:36:55PM -0400, James Coleman wrote:

On Fri, Mar 27, 2020 at 9:19 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Fri, Mar 27, 2020 at 12:51:34PM -0400, James Coleman wrote:

In a previous email I'd summarized remaining TODOs I'd found. Here's
an updated listed with several resolved.

Resolved:

2. Not marked in the patch, but in nodeIncrementalSort.c
ExecIncrementalSort() I wonder if perhaps we should move the algorithm
discussion comments up to the file header comment. On the other hand,
I suppose it could be valuable to leave the the file header comment
more high level about the mathematical properties of incremental sort
rather than discussing the details of the hybrid mode.

I've decided to do this, and the attached patch series includes the change.

It's a bit tough to find the right balance what to put into the header
comment and what should go to function comments, but this seems mostly
reasonable. I wouldn't use the double-tab indentation and the copyright
notices should stay at the top.

Fixed. I also re-ran pg_indent on the the nodeIncrementalSort.c file.

3. nodeIncrementalSort.c ExecIncrementalSort() in the main for loop:
* TODO: do we need to check for interrupts inside these loops or
* will the outer node handle that?

It seems like what we have is sufficient, given that the nodes (and
sort) we rely on have their own calls. The one place where someone
might make an argument otherwise would be in the mode transition
function where we copy tuples from the full sort state to the
presorted sort state. If this is a problem, let me know, and I'll
change it, but I'm proceeding under the assumption for now that it's
not.

I think what we have now is sufficient.

4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
is suspect. I've mentioned previously I don't have a great mental
model of how rescan works and its invariants (IIRC someone said it was
about moving around a result set in a cursor). Regardless I'm pretty
sure this code just doesn't work correctly. Additionally the sort_Done
variable is poorly named; it probably would make more sense to call it
something like "scanHasBegun". I'm waiting to change it though until
cleaning up this code more holistically.

Fixed, as described in previous email.

6. regress/expected/incremental_sort.out:
-- TODO if an analyze happens here the plans might change; should we
-- solve by inserting extra rows or by adding a GUC that would somehow
-- forcing the time of plan we expect.

I've decided this doesn't seem to be a real issue, so, comment removed.

OK

7. Not listed as a comment in the patch, but I need to modify the
testing for analyze output to parse out the memory/disk stats to the
tests are stable.

Included in the attached patch series. I use plpgsql to munge out the
space kB numbers. I also discovered two bugs in the JSON output along
the way and fixed those (memory and disk need to be output separate;
disk was using the wrong "space type" enum). Finally I also use
plpgsql to check a few invariants (for now just that max space is
greater than or equal to the average.

OK

8. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* XXX At the moment this can only ever return a list with a single element,
* because it looks at query_pathkeys only. So we might return the pathkeys
* directly, but it seems plausible we'll want to consider other orderings
* in the future.

I think we just leave this in as a comment.

Fine with me.

As a side note here, I'm wondering if this (determining useful pathkeys)
can be made a bit smarter by looking both at query_pathkeys and pathkeys
useful for merging, similarly to what truncate_useless_pathkeys() does.
But that can be seen as an improvement of what we do now.

Unless your comment below about looking at truncate_useless_pathkeys
is implying you're considering aiming to get this in now, I wonder if
we should just expand the comment to reference pathkeys useful for
merging as a possible future extension.

Maybe. I've been thinking about how to generate those path keys, but
it's a bit tricky, because pathkeys_useful_for_merging() - the thing
that backs truncate_useless_pathkeys() actually gets pathkeys and merely
verifies if those are useful for merging. But this code needs to do the
opposite - generate those pathkeys.

But let's say we know we have a join on columns (a,b,c). For each of
those PathKey values we know it's useful for merging, but should we
generate pathkeys (a,b,c), (b,c,a), ... or any other permutation? I
suppose we can look at pathkeys for existing paths of the relation to
prune the possibilities a bit, but what then?

BTW I wonder if we actually need the ec_has_volatile check in
get_useful_pathkeys_for_relation. The comment says we can't but is it
really true? pathkeys_useful_for_ordering doesn't do any such checks, so
I'd bet this is merely an unnecessary copy-paste from postgres_fdw.
Opinions?

9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* Considering query_pathkeys is always worth it, because it might let us
* avoid a local sort.

That originally was a copy from the fdw code, but since the two
functions have diverged (Is that concerning? I could be confusing, but
isn't a compilation problem) I didn't move the function.

I think it's OK the two functions diverged, it's simply because the FDW
one needs to check other things too. But I might rework this once I look
closer at truncate_useless_pathkeys.

Agreed, for now at least. It's tempting to think they should always be
shared, but I'm not convinced (without a lot more digging) that this
represents structural rather than incidental duplication.

The more I look at pathkeys_useful_for_ordering() the more I think the
get_useful_pathkeys_for_relation() function should look more like it
than the postgres_fdw one ...

I did notice though that find_em_expr_for_rel() is wholesale copied
(and unchanged) from the fdw code, so I moved it to equivclass.c so
both places can share it.

+1

... which would also get rid of find_em_expr_for_rel().

Still remaining:

1. src/backend/optimizer/util/pathnode.c add_partial_path()
* XXX Perhaps we could do this only when incremental sort is enabled,
* and use the simpler version (comparing just total cost) otherwise?

I don't have a strong opinion here. It doesn't seem like a significant
difference in terms of cost?

5. planner.c create_ordered_paths:
* XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
* other pathkeys (grouping, ...) like generate_useful_gather_paths.

10. optimizer/path/allpaths.c generate_useful_gather_paths:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.

11. In the same function as the above:
* XXX Can't we skip this (maybe only for the cheapest partial path)
* when the path is already sorted? Then it's likely duplicate with
* the path created by generate_gather_paths.

12. In the same function as the above:
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

13. planner.c create_ordered_paths:
* XXX This is probably duplicate with the paths we already generate
* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.

Tomas, any chance you could take a look at the above XXX/questions? I
believe all of them that remain relate to the planner patches.

Yes, I'll take a look over the weekend.

Awesome, thanks.

I collapsed things down including the changes referenced in this
email, since they were all comment formatting changes.

Thanks.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#241

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#240)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Mar 27, 2020 at 10:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

As a side note here, I'm wondering if this (determining useful pathkeys)
can be made a bit smarter by looking both at query_pathkeys and pathkeys
useful for merging, similarly to what truncate_useless_pathkeys() does.
But that can be seen as an improvement of what we do now.

Unless your comment below about looking at truncate_useless_pathkeys
is implying you're considering aiming to get this in now, I wonder if
we should just expand the comment to reference pathkeys useful for
merging as a possible future extension.

Maybe. I've been thinking about how to generate those path keys, but
it's a bit tricky, because pathkeys_useful_for_merging() - the thing
that backs truncate_useless_pathkeys() actually gets pathkeys and merely
verifies if those are useful for merging. But this code needs to do the
opposite - generate those pathkeys.

But let's say we know we have a join on columns (a,b,c). For each of
those PathKey values we know it's useful for merging, but should we
generate pathkeys (a,b,c), (b,c,a), ... or any other permutation? I
suppose we can look at pathkeys for existing paths of the relation to
prune the possibilities a bit, but what then?

I'm not convinced it's worth it this time around. Both because of the
late hour in the CF etc., but also because it seems likely to become
pretty complex quickly, and also far more likely to raise performance
questions in planning if there's not a lot of work done to keep it
limited.

BTW I wonder if we actually need the ec_has_volatile check in
get_useful_pathkeys_for_relation. The comment says we can't but is it
really true? pathkeys_useful_for_ordering doesn't do any such checks, so
I'd bet this is merely an unnecessary copy-paste from postgres_fdw.
Opinions?

I hadn't really looked at that part in depth before, but after reading
it over, re-reading the definition of volatility, and running a few
basic queries, I agree.

For example: we already do allow volatile pathkeys in a simple query like:
-- t(a, b) with index on (a)
select * from t order by a, random();

It makes sense that you couldn't push down such a sort to a foreign
server, given there's no constraints said function isn't operating
directly on the primary database (in the fdw case). But that obviously
wouldn't apply here.

9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* Considering query_pathkeys is always worth it, because it might let us
* avoid a local sort.

That originally was a copy from the fdw code, but since the two
functions have diverged (Is that concerning? I could be confusing, but
isn't a compilation problem) I didn't move the function.

I think it's OK the two functions diverged, it's simply because the FDW
one needs to check other things too. But I might rework this once I look
closer at truncate_useless_pathkeys.

Agreed, for now at least. It's tempting to think they should always be
shared, but I'm not convinced (without a lot more digging) that this
represents structural rather than incidental duplication.

The more I look at pathkeys_useful_for_ordering() the more I think the
get_useful_pathkeys_for_relation() function should look more like it
than the postgres_fdw one ...

If we go down that path (and indeed this is actually implied by
removing the volatile check too), doesn't that really just mean (at
least for now) that get_useful_pathkeys_for_relation goes away
entirely and we effectively use root->query_pathkeys directly? The
only thing your lose them is the check that each eclass has a member
in the rel. But that check probably wasn't quite right anyway: at
least for incremental sort (maybe not regular sort), I think we
probably don't necessarily care that the entire list has members in
the rel, but rather that some prefix does, right? The idea there would
be that we could create a gather merge here that supplies a partial
ordering (that prefix of query_pathkeys) and then above it the planner
might place another incremental sort (say above a join), or perhaps
even a join above that would actually be sufficient (since many joins
are capable of providing an ordering anyway).

I've attached (added prefix .txt to avoid the cfbot assuming this is a
full series) an idea in that direction to see if we're thinking along
the same route. You'll want to apply and view with `git diff -w`
probably.

The attached also adds a few XXX comments. In particular, I wonder if
we should verify that some prefix of the root->query_pathkeys is
actually made up of eclass members for the current rel, because
otherwise I think we can skip the loop on the subpaths entirely.

I did notice though that find_em_expr_for_rel() is wholesale copied
(and unchanged) from the fdw code, so I moved it to equivclass.c so
both places can share it.

+1

... which would also get rid of find_em_expr_for_rel().

... which, I think, would retain the need for find_em_expr_for_rel().

Note: The attached applied to the previous series compiles and runs
make check...but I haven't really tested it; it's meant more for "is
this the direction we want to go".

James

Attachments:

use_truncate_useless_pathkeys.patch.txttext/plain; charset=US-ASCII; name=use_truncate_useless_pathkeys.patch.txtDownload

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 32bf734820..eea41fafe3 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2727,63 +2727,6 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
-/*
- * get_useful_pathkeys_for_relation
- *		Determine which orderings of a relation might be useful.
- *
- * Getting data in sorted order can be useful either because the requested
- * order matches the final output ordering for the overall query we're
- * planning, or because it enables an efficient merge join.  Here, we try
- * to figure out which pathkeys to consider.
- *
- * This allows us to do incremental sort on top of an index scan under a gather
- * merge node, i.e. parallelized.
- *
- * XXX At the moment this can only ever return a list with a single element,
- * because it looks at query_pathkeys only. So we might return the pathkeys
- * directly, but it seems plausible we'll want to consider other orderings
- * in the future.
- */
-static List *
-get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
-{
-	List	   *useful_pathkeys_list = NIL;
-	ListCell   *lc;
-
-	/*
-	 * Considering query_pathkeys is always worth it, because it might allow us
-	 * to avoid a total sort when we have a partially presorted path available.
-	 */
-	if (root->query_pathkeys)
-	{
-		bool		query_pathkeys_ok = true;
-
-		foreach(lc, root->query_pathkeys)
-		{
-			PathKey    *pathkey = (PathKey *) lfirst(lc);
-			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
-			Expr	   *em_expr;
-
-			/*
-			 * We can't use incremental sort for pathkeys containing volatile
-			 * expressions. We could walk the exppression itself, but checking
-			 * ec_has_volatile here saves some cycles.
-			 */
-			if (pathkey_ec->ec_has_volatile ||
-				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
-			{
-				query_pathkeys_ok = false;
-				break;
-			}
-		}
-
-		if (query_pathkeys_ok)
-			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
-	}
-
-	return useful_pathkeys_list;
-}
-
 /*
  * generate_useful_gather_paths
  *		Generate parallel access paths for a relation by pushing a Gather or
@@ -2800,7 +2743,6 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 	ListCell   *lc;
 	double		rows;
 	double	   *rowsp = NULL;
-	List	   *useful_pathkeys_list = NIL;
 	Path	   *cheapest_partial_path = NULL;
 
 	/* If there are no partial paths, there's nothing to do here. */
@@ -2818,9 +2760,6 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 	if (!enable_incrementalsort)
 		return;
 
-	/* consider incremental sort for interesting orderings */
-	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
-
 	/* used for explicit (full) sort paths */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
 
@@ -2829,104 +2768,125 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 	 *
 	 * XXX I wonder if we need to consider adding a projection here, as
 	 * create_ordered_paths does.
+	 *
+	 * XXX I wonder if it'd be worth verifying that at least some prefix
+	 * of root->query_pathkeys has eclass members for this rel before
+	 * looping through the subpaths. If, for example, the first pathkey
+	 * isn't in this rel, then we know this loop won't ever be useful.
 	 */
-	foreach(lc, useful_pathkeys_list)
+	foreach(lc, rel->partial_pathlist)
 	{
-		List	   *useful_pathkeys = lfirst(lc);
-		ListCell   *lc2;
 		bool		is_sorted;
 		int			presorted_keys;
+		Path	   *subpath = (Path *) lfirst(lc);
+		GatherMergePath *path;
+		List	   *useful_pathkeys = NIL;
 
-		foreach(lc2, rel->partial_pathlist)
-		{
-			Path	   *subpath = (Path *) lfirst(lc2);
-			GatherMergePath *path;
-
-			/* path has no ordering at all, can't use incremental sort */
-			if (subpath->pathkeys == NIL)
-				continue;
+		/* path has no ordering at all, can't use incremental sort */
+		if (subpath->pathkeys == NIL)
+			continue;
 
-			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
-													 subpath->pathkeys,
-													 &presorted_keys);
+		/* consider incremental sort for interesting orderings */
+		useful_pathkeys = truncate_useless_pathkeys(root, rel, subpath->pathkeys);
 
-			/*
-			 * When the partial path is already sorted, we can just add a gather
-			 * merge on top, and we're done - no point in adding explicit sort.
-			 *
-			 * XXX Can't we skip this (maybe only for the cheapest partial path)
-			 * when the path is already sorted? Then it's likely duplicate with
-			 * the path created by generate_gather_paths.
-			 */
-			if (is_sorted)
-			{
-				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
-												subpath->pathkeys, NULL, rowsp);
+		/*
+		 * Getting data in sorted order can be useful either because the requested
+		 * order matches the final output ordering for the overall query we're
+		 * planning, or because it enables an efficient merge join.  Here, we try
+		 * to figure out which pathkeys to consider.
+		 *
+		 * This allows us to do incremental sort on top of an index scan under a gather
+		 * merge node, i.e. parallelized.
+		 *
+		 * XXX At the moment this can only ever return a list with a single element,
+		 * because it looks at query_pathkeys only. So we might return the pathkeys
+		 * directly, but it seems plausible we'll want to consider other orderings
+		 * in the future.
+		 *
+		 * XXX Should this just be root->query_pathkeys == useful_pathkeys
+		 * (or length of the two are equal) given our using truncate_useless_pathkeys
+		 * above?
+		 */
+		is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+												 subpath->pathkeys,
+												 &presorted_keys);
 
-				add_path(rel, &path->path);
-				continue;
-			}
+		/*
+		 * When the partial path is already sorted, we can just add a gather
+		 * merge on top, and we're done - no point in adding explicit sort.
+		 *
+		 * XXX Can't we skip this (maybe only for the cheapest partial path)
+		 * when the path is already sorted? Then it's likely duplicate with
+		 * the path created by generate_gather_paths.
+		 */
+		if (is_sorted)
+		{
+			path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+											subpath->pathkeys, NULL, rowsp);
 
-			Assert(!is_sorted);
+			add_path(rel, &path->path);
+			continue;
+		}
 
-			/*
-			 * Consider regular sort for the cheapest partial path (for each
-			 * useful pathkeys). We know the path is not sorted, because we'd
-			 * not get here otherwise.
-			 *
-			 * XXX This is not redundant with the gather merge path created in
-			 * generate_gather_paths, because that merely preserves ordering of
-			 * the cheapest partial path, while here we add an explicit sort to
-			 * get match the useful ordering.
-			 */
-			if (cheapest_partial_path == subpath)
-			{
-				Path	   *tmp;
+		Assert(!is_sorted);
 
-				tmp = (Path *) create_sort_path(root,
-												rel,
-												subpath,
-												useful_pathkeys,
-												-1.0);
+		/*
+		 * Consider regular sort for the cheapest partial path (for each
+		 * useful pathkeys). We know the path is not sorted, because we'd
+		 * not get here otherwise.
+		 *
+		 * XXX This is not redundant with the gather merge path created in
+		 * generate_gather_paths, because that merely preserves ordering of
+		 * the cheapest partial path, while here we add an explicit sort to
+		 * get match the useful ordering.
+		 */
+		if (cheapest_partial_path == subpath)
+		{
+			Path	   *tmp;
 
-				rows = tmp->rows * tmp->parallel_workers;
+			tmp = (Path *) create_sort_path(root,
+											rel,
+											subpath,
+											useful_pathkeys,
+											-1.0);
 
-				path = create_gather_merge_path(root, rel,
-												tmp,
-												rel->reltarget,
-												tmp->pathkeys,
-												NULL,
-												rowsp);
+			rows = tmp->rows * tmp->parallel_workers;
 
-				add_path(rel, &path->path);
+			path = create_gather_merge_path(root, rel,
+											tmp,
+											rel->reltarget,
+											tmp->pathkeys,
+											NULL,
+											rowsp);
 
-				/* Fall through */
-			}
+			add_path(rel, &path->path);
 
-			/*
-			 * Consider incremental sort, but only when the subpath is already
-			 * partially sorted on a pathkey prefix.
-			 */
-			if (presorted_keys > 0)
-			{
-				Path	   *tmp;
+			/* Fall through */
+		}
 
-				tmp = (Path *) create_incremental_sort_path(root,
-															rel,
-															subpath,
-															useful_pathkeys,
-															presorted_keys,
-															-1);
-
-				path = create_gather_merge_path(root, rel,
-												tmp,
-												rel->reltarget,
-												tmp->pathkeys,
-												NULL,
-												rowsp);
-
-				add_path(rel, &path->path);
-			}
+		/*
+		 * Consider incremental sort, but only when the subpath is already
+		 * partially sorted on a pathkey prefix.
+		 */
+		if (presorted_keys > 0)
+		{
+			Path	   *tmp;
+
+			tmp = (Path *) create_incremental_sort_path(root,
+														rel,
+														subpath,
+														useful_pathkeys,
+														presorted_keys,
+														-1);
+
+			path = create_gather_merge_path(root, rel,
+											tmp,
+											rel->reltarget,
+											tmp->pathkeys,
+											NULL,
+											rowsp);
+
+			add_path(rel, &path->path);
 		}
 	}
 }

#242

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#241)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 28, 2020 at 10:19:04AM -0400, James Coleman wrote:

On Fri, Mar 27, 2020 at 10:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

As a side note here, I'm wondering if this (determining useful pathkeys)
can be made a bit smarter by looking both at query_pathkeys and pathkeys
useful for merging, similarly to what truncate_useless_pathkeys() does.
But that can be seen as an improvement of what we do now.

Unless your comment below about looking at truncate_useless_pathkeys
is implying you're considering aiming to get this in now, I wonder if
we should just expand the comment to reference pathkeys useful for
merging as a possible future extension.

Maybe. I've been thinking about how to generate those path keys, but
it's a bit tricky, because pathkeys_useful_for_merging() - the thing
that backs truncate_useless_pathkeys() actually gets pathkeys and merely
verifies if those are useful for merging. But this code needs to do the
opposite - generate those pathkeys.

But let's say we know we have a join on columns (a,b,c). For each of
those PathKey values we know it's useful for merging, but should we
generate pathkeys (a,b,c), (b,c,a), ... or any other permutation? I
suppose we can look at pathkeys for existing paths of the relation to
prune the possibilities a bit, but what then?

I'm not convinced it's worth it this time around. Both because of the
late hour in the CF etc., but also because it seems likely to become
pretty complex quickly, and also far more likely to raise performance
questions in planning if there's not a lot of work done to keep it
limited.

Agreed. There'll be time to add more optimizations in the future.

BTW I wonder if we actually need the ec_has_volatile check in
get_useful_pathkeys_for_relation. The comment says we can't but is it
really true? pathkeys_useful_for_ordering doesn't do any such checks, so
I'd bet this is merely an unnecessary copy-paste from postgres_fdw.
Opinions?

I hadn't really looked at that part in depth before, but after reading
it over, re-reading the definition of volatility, and running a few
basic queries, I agree.

For example: we already do allow volatile pathkeys in a simple query like:
-- t(a, b) with index on (a)
select * from t order by a, random();

It makes sense that you couldn't push down such a sort to a foreign
server, given there's no constraints said function isn't operating
directly on the primary database (in the fdw case). But that obviously
wouldn't apply here.

Thanks for confirming my reasoning.

9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* Considering query_pathkeys is always worth it, because it might let us
* avoid a local sort.

That originally was a copy from the fdw code, but since the two
functions have diverged (Is that concerning? I could be confusing, but
isn't a compilation problem) I didn't move the function.

I think it's OK the two functions diverged, it's simply because the FDW
one needs to check other things too. But I might rework this once I look
closer at truncate_useless_pathkeys.

Agreed, for now at least. It's tempting to think they should always be
shared, but I'm not convinced (without a lot more digging) that this
represents structural rather than incidental duplication.

The more I look at pathkeys_useful_for_ordering() the more I think the
get_useful_pathkeys_for_relation() function should look more like it
than the postgres_fdw one ...

If we go down that path (and indeed this is actually implied by
removing the volatile check too), doesn't that really just mean (at
least for now) that get_useful_pathkeys_for_relation goes away
entirely and we effectively use root->query_pathkeys directly?

Yes, basically.

The only thing your lose them is the check that each eclass has a
member in the rel. But that check probably wasn't quite right anyway:
at least for incremental sort (maybe not regular sort), I think we
probably don't necessarily care that the entire list has members in the
rel, but rather that some prefix does, right?

Probably, although I always forget how exactly this EC business works.
My reasoning is more "If pathkeys_useful_for_ordering does not need
that, why should we need it here?"

The idea there would be that we could create a gather merge here that
supplies a partial ordering (that prefix of query_pathkeys) and then
above it the planner might place another incremental sort (say above a
join), or perhaps even a join above that would actually be sufficient
(since many joins are capable of providing an ordering anyway).

Not sure. Isn't the idea that we do the Incremental Sort below the
Gather Merge, because then it's actually done in parallel?

I've attached (added prefix .txt to avoid the cfbot assuming this is a
full series) an idea in that direction to see if we're thinking along
the same route. You'll want to apply and view with `git diff -w`
probably.

The attached also adds a few XXX comments. In particular, I wonder if
we should verify that some prefix of the root->query_pathkeys is
actually made up of eclass members for the current rel, because
otherwise I think we can skip the loop on the subpaths entirely.

Hmmm, that's an interesting possible optimization. I wonder if that
actually saves anything, though, because the number of paths in the loop
is likely fairly low.

I did notice though that find_em_expr_for_rel() is wholesale
copied (and unchanged) from the fdw code, so I moved it to
equivclass.c so both places can share it.

+1

... which would also get rid of find_em_expr_for_rel().

... which, I think, would retain the need for find_em_expr_for_rel().

Note: The attached applied to the previous series compiles and runs
make check...but I haven't really tested it; it's meant more for "is
this the direction we want to go".

Thanks, I'll take a look.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#243

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#242)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 28, 2020 at 2:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* Considering query_pathkeys is always worth it, because it might let us
* avoid a local sort.

That originally was a copy from the fdw code, but since the two
functions have diverged (Is that concerning? I could be confusing, but
isn't a compilation problem) I didn't move the function.

I think it's OK the two functions diverged, it's simply because the FDW
one needs to check other things too. But I might rework this once I look
closer at truncate_useless_pathkeys.

Agreed, for now at least. It's tempting to think they should always be
shared, but I'm not convinced (without a lot more digging) that this
represents structural rather than incidental duplication.

The more I look at pathkeys_useful_for_ordering() the more I think the
get_useful_pathkeys_for_relation() function should look more like it
than the postgres_fdw one ...

If we go down that path (and indeed this is actually implied by
removing the volatile check too), doesn't that really just mean (at
least for now) that get_useful_pathkeys_for_relation goes away
entirely and we effectively use root->query_pathkeys directly?

Yes, basically.

The only thing your lose them is the check that each eclass has a
member in the rel. But that check probably wasn't quite right anyway:
at least for incremental sort (maybe not regular sort), I think we
probably don't necessarily care that the entire list has members in the
rel, but rather that some prefix does, right?

Probably, although I always forget how exactly this EC business works.
My reasoning is more "If pathkeys_useful_for_ordering does not need
that, why should we need it here?"

i think it effectively does, since it's called by
truncate_useless_pathkeys with the pathkeys list a path provides and
root-query_pathkeys and tries to find a prefix.

So if there wasn't a prefix of matching eclasses, we'd return 0 as the
number of matching prefix pathkeys, and thus a NIL list to the caller
of truncate_useless_pathkeys.

The idea there would be that we could create a gather merge here that
supplies a partial ordering (that prefix of query_pathkeys) and then
above it the planner might place another incremental sort (say above a
join), or perhaps even a join above that would actually be sufficient
(since many joins are capable of providing an ordering anyway).

Not sure. Isn't the idea that we do the Incremental Sort below the
Gather Merge, because then it's actually done in parallel?

Yeah, I think as I was typing this my ideas got kinda mixed up a bit,
or at least was confusing. The incremental sort *would* need a path
that is ordered by a prefix of query_pathkeys and provide the sort on
the full query_pathkeys.

But the incremental sort at this point would still be on a path with a
given rel, and that path's rel would need to contain all of the
eclasses for the pathkeys we want as the final ordering to be able to
sort on them, right? For example, suppose:
select * from t join s order by t.a, s.b -- index on t.a

We'd have a presorted path on t.a that's has a prefix of the
query_pathkeys, but we couldn't have the incremental sort on path
whose rel only contained t, right? It'd have to a rel that was the
result of the join, otherwise we don't yet have access to s.b.

Given that, I think we have two options in this code. Suppose query
pathkeys (a, b, c):
1. Build an incremental sort path on (a, b, c) if we have a path
ordered by (a) or (a, b) but only when (c) is already available.
2. Additionally, (for the most possible paths generated): build an
incremental sort path on (a, b) if we have a path ordered by (a) but
(c) isn't yet available.

Either way I think we need the ability to know if all (or some subset)
of the query pathkeys are for eclasses we have access to in the
current rel.

I've attached (added prefix .txt to avoid the cfbot assuming this is a
full series) an idea in that direction to see if we're thinking along
the same route. You'll want to apply and view with `git diff -w`
probably.

The attached also adds a few XXX comments. In particular, I wonder if
we should verify that some prefix of the root->query_pathkeys is
actually made up of eclass members for the current rel, because
otherwise I think we can skip the loop on the subpaths entirely.

Hmmm, that's an interesting possible optimization. I wonder if that
actually saves anything, though, because the number of paths in the loop
is likely fairly low.

If what I said above is correct (and please poke holes in it if
possible), then I think we have to know the matching eclass count
anyway, so we might as well include the optimization since it'd be a
simple int comparison.

I did notice though that find_em_expr_for_rel() is wholesale
copied (and unchanged) from the fdw code, so I moved it to
equivclass.c so both places can share it.

+1

... which would also get rid of find_em_expr_for_rel().

... which, I think, would retain the need for find_em_expr_for_rel().

Note: The attached applied to the previous series compiles and runs
make check...but I haven't really tested it; it's meant more for "is
this the direction we want to go".

Thanks, I'll take a look.

James

#244

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#241)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 28, 2020 at 10:19:04AM -0400, James Coleman wrote:

On Fri, Mar 27, 2020 at 10:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

The more I look at pathkeys_useful_for_ordering() the more I think the
get_useful_pathkeys_for_relation() function should look more like it
than the postgres_fdw one ...

If we go down that path (and indeed this is actually implied by
removing the volatile check too), doesn't that really just mean (at
least for now) that get_useful_pathkeys_for_relation goes away
entirely and we effectively use root->query_pathkeys directly? The
only thing your lose them is the check that each eclass has a member
in the rel. But that check probably wasn't quite right anyway: at
least for incremental sort (maybe not regular sort), I think we
probably don't necessarily care that the entire list has members in
the rel, but rather that some prefix does, right? The idea there would
be that we could create a gather merge here that supplies a partial
ordering (that prefix of query_pathkeys) and then above it the planner
might place another incremental sort (say above a join), or perhaps
even a join above that would actually be sufficient (since many joins
are capable of providing an ordering anyway).

I've attached (added prefix .txt to avoid the cfbot assuming this is a
full series) an idea in that direction to see if we're thinking along
the same route. You'll want to apply and view with `git diff -w`
probably.

Hmmm, I'm not sure the patch is quite correct.

Firstly, I suggest we don't remove get_useful_pathkeys_for_relation
entirely, because that allows us to add more useful pathkeys in the
future (even if we don't consider pathkeys useful for merging now).
We could also do the EC optimization in the function, return NIL and
not loop over partial_pathlist at all. That's cosmetic issue, though.

More importantly, I'm not sure this makes sense:

/* consider incremental sort for interesting orderings */
useful_pathkeys = truncate_useless_pathkeys(root, rel, subpath->pathkeys);

...

is_sorted = pathkeys_common_contained_in(useful_pathkeys,
subpath->pathkeys,
&presorted_keys);

I mean, useful_pathkeys is bound to be a sublist of subpath->pathkeys,
right? So how could this ever return is_sorted=false?

The whole point is to end up with query_pathkeys (or whatever pathkeys
we deem useful), but this does not do that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#245

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#244)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 28, 2020 at 5:30 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 28, 2020 at 10:19:04AM -0400, James Coleman wrote:

On Fri, Mar 27, 2020 at 10:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

The more I look at pathkeys_useful_for_ordering() the more I think the
get_useful_pathkeys_for_relation() function should look more like it
than the postgres_fdw one ...

If we go down that path (and indeed this is actually implied by
removing the volatile check too), doesn't that really just mean (at
least for now) that get_useful_pathkeys_for_relation goes away
entirely and we effectively use root->query_pathkeys directly? The
only thing your lose them is the check that each eclass has a member
in the rel. But that check probably wasn't quite right anyway: at
least for incremental sort (maybe not regular sort), I think we
probably don't necessarily care that the entire list has members in
the rel, but rather that some prefix does, right? The idea there would
be that we could create a gather merge here that supplies a partial
ordering (that prefix of query_pathkeys) and then above it the planner
might place another incremental sort (say above a join), or perhaps
even a join above that would actually be sufficient (since many joins
are capable of providing an ordering anyway).

I've attached (added prefix .txt to avoid the cfbot assuming this is a
full series) an idea in that direction to see if we're thinking along
the same route. You'll want to apply and view with `git diff -w`
probably.

Hmmm, I'm not sure the patch is quite correct.

Firstly, I suggest we don't remove get_useful_pathkeys_for_relation
entirely, because that allows us to add more useful pathkeys in the
future (even if we don't consider pathkeys useful for merging now).
We could also do the EC optimization in the function, return NIL and
not loop over partial_pathlist at all. That's cosmetic issue, though.

More importantly, I'm not sure this makes sense:

/* consider incremental sort for interesting orderings */
useful_pathkeys = truncate_useless_pathkeys(root, rel, subpath->pathkeys);

...

is_sorted = pathkeys_common_contained_in(useful_pathkeys,
subpath->pathkeys,
&presorted_keys);

I mean, useful_pathkeys is bound to be a sublist of subpath->pathkeys,
right? So how could this ever return is_sorted=false?

The whole point is to end up with query_pathkeys (or whatever pathkeys
we deem useful), but this does not do that.

Yes, that's obviously a thinko in my rush to get an idea out.

I think useful_pathkeys there would be essentially
root->query_pathkeys, or, more correctly the prefix of query_pathkeys
that has eclasses shared with the current rel. Or, if we go with the
more restrictive approach (see the two approaches I mentioned
earlier), then either NIL or query_pathkeys (if they all matches
eclasses in the current rel).

Given those requirements, I agree that keeping
get_useful_pathkeys_for_relation makes sense to wrap up all of that
behavior.

James

#246

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#244)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Hi,

Attached is my take on simplification of the useful pathkeyes thing. It
keeps the function, but it truncates query_pathkeys to only members with
EC members in the relation. I think that's essentially the optimization
you've proposed.

I've also noticed an issue in explain output. EXPLAIN ANALYZE on a simple
query gives me this:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=66.30..816060.48 rows=8333226 width=24) (actual time=6.464..19091.006 rows=10000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=66.28..729188.13 rows=4166613 width=24) (actual time=1.836..13401.109 rows=3333333 loops=3)
Sort Key: a, b, c
Presorted Key: a, b
Full-sort Groups: 4156 (Methods: quicksort) Memory: 30kB (avg), 30kB (max)
Presorted Groups: 4137 (Methods: quicksort) Memory: 108kB (avg), 111kB (max)
Full-sort Groups: 6888 (Methods: ) Memory: 30kB (avg), 30kB (max)
Presorted Groups: 6849 (Methods: ) Memory: 121kB (avg), 131kB (max)
Full-sort Groups: 6869 (Methods: ) Memory: 30kB (avg), 30kB (max)
Presorted Groups: 6816 (Methods: ) Memory: 128kB (avg), 132kB (max)
-> Parallel Index Scan using t_a_b_idx on t (cost=0.43..382353.69 rows=4166613 width=24) (actual time=0.033..9346.679 rows=3333333 loops=3)
Planning Time: 0.133 ms
Execution Time: 23998.669 ms
(15 rows)

while with incremental sort disabled it looks like this:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=734387.50..831676.35 rows=8333226 width=24) (actual time=5597.978..14967.246 rows=10000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=734387.47..744804.00 rows=4166613 width=24) (actual time=5584.616..7645.711 rows=3333333 loops=3)
Sort Key: a, b, c
Sort Method: external merge Disk: 111216kB
Worker 0: Sort Method: external merge Disk: 109552kB
Worker 1: Sort Method: external merge Disk: 112112kB
-> Parallel Seq Scan on t (cost=0.00..105361.13 rows=4166613 width=24) (actual time=0.011..1753.128 rows=3333333 loops=3)
Planning Time: 0.048 ms
Execution Time: 19682.582 ms
(11 rows)

So I think there's a couple of issues:

1) Missing worker identification (Worker #).

2) Missing method for workers (we have it for the leader, though).

3) I'm not sure why the lable is "Methods" instead of "Sort Method", and
why it's in parenthesis.

4) Not sure having two lines for each worker is a great idea.

5) I'd probably prefer having multiple labels for avg/max memory values,
instead of (avg) and (max) notes. Also, I think we use "peak" in this
context instead of "max".

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

simplify_useful_pathkeys.txttext/plain; charset=us-asciiDownload

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 32bf734820..49d4990e66 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2748,7 +2748,6 @@ static List *
 get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 {
 	List	   *useful_pathkeys_list = NIL;
-	ListCell   *lc;
 
 	/*
 	 * Considering query_pathkeys is always worth it, because it might allow us
@@ -2756,29 +2755,27 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 	 */
 	if (root->query_pathkeys)
 	{
-		bool		query_pathkeys_ok = true;
+		ListCell   *lc;
+		List	   *pathkeys = NIL;
 
 		foreach(lc, root->query_pathkeys)
 		{
 			PathKey    *pathkey = (PathKey *) lfirst(lc);
 			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
-			Expr	   *em_expr;
 
 			/*
-			 * We can't use incremental sort for pathkeys containing volatile
-			 * expressions. We could walk the exppression itself, but checking
-			 * ec_has_volatile here saves some cycles.
+			 * We can't build Incremental Sort when pathkeys contain elements
+			 * without an EC member not contained in the current relation, so
+			 * just ignore anything after the first such pathkey.
 			 */
-			if (pathkey_ec->ec_has_volatile ||
-				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
-			{
-				query_pathkeys_ok = false;
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
 				break;
-			}
+
+			pathkeys = lappend(pathkeys, pathkey);
 		}
 
-		if (query_pathkeys_ok)
-			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+		if (pathkeys)
+			useful_pathkeys_list = lappend(useful_pathkeys_list, pathkeys);
 	}
 
 	return useful_pathkeys_list;

#247

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#246)

7 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 28, 2020 at 6:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

Attached is my take on simplification of the useful pathkeyes thing. It
keeps the function, but it truncates query_pathkeys to only members with
EC members in the relation. I think that's essentially the optimization
you've proposed.

Thanks. I've included that in the patch series in this email (as a
separate patch) with a few additional comments.

I've also noticed that the enabled_incrementalsort check in
generate_useful_gather_paths seemed broken, because it returned us out
of the function before creating either a plain gather merge (if
already sorted) or an explicit sort path. I've included a patch that
moves it to the if block that actually builds the incremental sort
path.

I've also noticed an issue in explain output. EXPLAIN ANALYZE on a simple
query gives me this:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=66.30..816060.48 rows=8333226 width=24) (actual time=6.464..19091.006 rows=10000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=66.28..729188.13 rows=4166613 width=24) (actual time=1.836..13401.109 rows=3333333 loops=3)
Sort Key: a, b, c
Presorted Key: a, b
Full-sort Groups: 4156 (Methods: quicksort) Memory: 30kB (avg), 30kB (max)
Presorted Groups: 4137 (Methods: quicksort) Memory: 108kB (avg), 111kB (max)
Full-sort Groups: 6888 (Methods: ) Memory: 30kB (avg), 30kB (max)
Presorted Groups: 6849 (Methods: ) Memory: 121kB (avg), 131kB (max)
Full-sort Groups: 6869 (Methods: ) Memory: 30kB (avg), 30kB (max)
Presorted Groups: 6816 (Methods: ) Memory: 128kB (avg), 132kB (max)
-> Parallel Index Scan using t_a_b_idx on t (cost=0.43..382353.69 rows=4166613 width=24) (actual time=0.033..9346.679 rows=3333333 loops=3)
Planning Time: 0.133 ms
Execution Time: 23998.669 ms
(15 rows)

while with incremental sort disabled it looks like this:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=734387.50..831676.35 rows=8333226 width=24) (actual time=5597.978..14967.246 rows=10000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=734387.47..744804.00 rows=4166613 width=24) (actual time=5584.616..7645.711 rows=3333333 loops=3)
Sort Key: a, b, c
Sort Method: external merge Disk: 111216kB
Worker 0: Sort Method: external merge Disk: 109552kB
Worker 1: Sort Method: external merge Disk: 112112kB
-> Parallel Seq Scan on t (cost=0.00..105361.13 rows=4166613 width=24) (actual time=0.011..1753.128 rows=3333333 loops=3)
Planning Time: 0.048 ms
Execution Time: 19682.582 ms
(11 rows)

So I think there's a couple of issues:

1) Missing worker identification (Worker #).

Fixed.

2) Missing method for workers (we have it for the leader, though).

Fixed. Since we can't have pointers in the parallel shared memory
space, we can't store the sort methods used in a list. To accomplish
the same goal, I've assigned the TuplesortMethod enum entries uique
bit positions, and store the methods used in a bitmask.

3) I'm not sure why the lable is "Methods" instead of "Sort Method", and
why it's in parenthesis.

I've removed the parentheses. It's labeled "Methods" since there can
be more than one (different batches could use different methods). I've
updated this to properly use singular/plural depending on the number
of methods used.

4) Not sure having two lines for each worker is a great idea.

I've left these in for now because the lines are already very long
(much, much longer than the worker lines in a standard sort node).
This is largely because we're trying to summarize many sort batches,
while standard sort nodes only have to give the exact stats from a
single batch.

See the example output later in the email.

5) I'd probably prefer having multiple labels for avg/max memory values,
instead of (avg) and (max) notes. Also, I think we use "peak" in this
context instead of "max".

Updated.

Here's the current output:

Limit (cost=1887419.20..1889547.68 rows=10000 width=8) (actual
time=13218.403..13222.519 rows=10000 loo
ps=1)
-> Gather Merge (cost=1887419.20..19624748.03 rows=83333360
width=8) (actual time=13218.401..13229.7
50 rows=10000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=1886419.17..10005010.55
rows=41666680 width=8) (actual time=13208.00
4..13208.586 rows=4425 loops=3)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 1 Sort Method: quicksort Memory:
avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort Memory:
avg=1681kB peak=1681kB
Worker 0: Full-sort Groups: 1 Sort Method: quicksort
Memory: avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort
Memory: avg=1680kB peak=1680kB
Worker 1: Full-sort Groups: 1 Sort Method: quicksort
Memory: avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort
Memory: avg=1682kB peak=1682kB
-> Parallel Index Scan using index_s_a on s
(cost=0.57..4967182.06 rows=41666680 width=8
) (actual time=0.455..11730.878 rows=6666668 loops=3)

James

Attachments:

v44-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v44-0002-Implement-incremental-sort.patchDownload

From 0ae6e4c64d16c579cd32557fafd5869f0334ada9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v44 2/7] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   22 +
 src/backend/commands/explain.c                |  223 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1267 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   63 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   74 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  307 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1400 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 40 files changed, 4144 insertions(+), 160 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 70854ae298..47ceea43d9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4542,6 +4542,28 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort, which
+        allows the planner to take advantage of data presorted on columns
+        <literal>1..m</literal> when an ordering on columns <literal>1..n</literal>
+        (where <literal>m < n</literal>) is required. Compared to regular sorts,
+        incremental sort allows returning tuples before the entire result set
+        has been sorted, particularly enabling optimizations with
+        <literal>LIMIT</literal> queries. It may also reduce memory usage and
+        the likelihood of spilling sorts to disk, but comes at the cost of
+        increased overhead splitting the result set into multiple sorting
+        batches. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 58141d8393..39d51848b6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,180 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, unconstify(char *, sortMethodName));
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..9fe93d5979
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1267 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b61d..a59926fa02 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..be569f56fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1831,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..1d7d4eb3e7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5da0528382..423ac25827 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,13 +4922,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4962,29 +4965,66 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..e20c055dea 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af876d1f01..b6ce724557 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aa44f0c9bf..bc2c2dbb1b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -359,6 +359,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..99d64a88af 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1295,25 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1371,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2765,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2815,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3312,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50f09..6127ab5912 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2023,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instruementation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..136d794219 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..ebb8412237
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1400 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                  
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: NNkB (avg), NNkB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "quicksort",                    +
+                 "top-N heapsort"                +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                         explain_analyze_without_memory                          
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

v44-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v44-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From e4a0edb72e456e2aea6dcfa69d33a58302f2b22a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v44 1/7] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v44-0005-rework-of-get_useful_pathkeys_for_relation.patchtext/x-patch; charset=US-ASCII; name=v44-0005-rework-of-get_useful_pathkeys_for_relation.patchDownload

From cf47f29fa4c254bf14d7107e75dfd432350dbfcf Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sat, 28 Mar 2020 20:03:27 -0400
Subject: [PATCH v44 5/7] rework of get_useful_pathkeys_for_relation

---
 src/backend/optimizer/path/allpaths.c | 32 +++++++++++++++------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 32bf734820..480803fb7a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2742,13 +2742,13 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
  * XXX At the moment this can only ever return a list with a single element,
  * because it looks at query_pathkeys only. So we might return the pathkeys
  * directly, but it seems plausible we'll want to consider other orderings
- * in the future.
+ * in the future. For example, we might want to consider pathkeys useful for
+ * merge joins.
  */
 static List *
 get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 {
 	List	   *useful_pathkeys_list = NIL;
-	ListCell   *lc;
 
 	/*
 	 * Considering query_pathkeys is always worth it, because it might allow us
@@ -2756,29 +2756,33 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 	 */
 	if (root->query_pathkeys)
 	{
-		bool		query_pathkeys_ok = true;
+		ListCell   *lc;
+		List	   *pathkeys = NIL;
 
 		foreach(lc, root->query_pathkeys)
 		{
 			PathKey    *pathkey = (PathKey *) lfirst(lc);
 			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
-			Expr	   *em_expr;
 
 			/*
-			 * We can't use incremental sort for pathkeys containing volatile
-			 * expressions. We could walk the exppression itself, but checking
-			 * ec_has_volatile here saves some cycles.
+			 * We can only build an Incremental Sort for pathkeys which contain
+			 * an EC member in the current relation, so ignore any suffix of the
+			 * list as soon as we find a pathkey without an EC member the
+			 * relation.
+			 *
+			 * By still returning the prefix of the pathkeys list that does meet
+			 * criteria of EC membership in the current relation, we enable not
+			 * just an incremental sort on the entirety of query_pathkeys but
+			 * also incremental sort below a JOIN.
 			 */
-			if (pathkey_ec->ec_has_volatile ||
-				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
-			{
-				query_pathkeys_ok = false;
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
 				break;
-			}
+
+			pathkeys = lappend(pathkeys, pathkey);
 		}
 
-		if (query_pathkeys_ok)
-			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+		if (pathkeys)
+			useful_pathkeys_list = lappend(useful_pathkeys_list, pathkeys);
 	}
 
 	return useful_pathkeys_list;
-- 
2.17.1

v44-0003-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v44-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From 45b95f4631d808ed74811d32fb04c0401515cd8a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v44 3/7] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 ----
 src/backend/optimizer/path/allpaths.c   | 208 +++++++++++++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++++
 src/backend/optimizer/plan/planner.c    | 130 ++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 5 files changed, 366 insertions(+), 32 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..32bf734820 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,210 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * We can't use incremental sort for pathkeys containing volatile
+			 * expressions. We could walk the exppression itself, but checking
+			 * ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3103,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 423ac25827..35e770f241 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6431,7 +6431,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6490,6 +6492,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6816,7 +6892,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6851,6 +6929,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7232,7 +7360,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..665f4065a4 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.17.1

v44-0004-A-couple-more-places-for-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v44-0004-A-couple-more-places-for-incremental-sort.patchDownload

From 6ffe301df13f067893a81b060fcc6fab950e48b4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH v44 4/7] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/plan/planner.c   | 218 ++++++++++++++++++++++++-
 2 files changed, 215 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 35e770f241..881302d0a3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5077,6 +5077,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6511,7 +6572,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			/* We've already skipped fully sorted paths above. */
 			Assert(!is_sorted);
 
-			/* no shared prefix, not point in building incremental sort */
+			/* no shared prefix, no point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
@@ -6577,12 +6638,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6613,6 +6680,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6884,6 +7001,58 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/* Consider incremental sort on all partial paths, if enabled. */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -7076,10 +7245,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7105,6 +7275,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7206,7 +7416,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.17.1

v44-0006-fix-inc-sort-enabled-check.patchtext/x-patch; charset=US-ASCII; name=v44-0006-fix-inc-sort-enabled-check.patchDownload

From 27d1e6ebd84515465c2fe70f664048124fcf3681 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sat, 28 Mar 2020 20:04:03 -0400
Subject: [PATCH v44 6/7] fix inc sort enabled check

---
 src/backend/optimizer/path/allpaths.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 480803fb7a..93d967e812 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2818,10 +2818,6 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 	/* generate the regular gather (merge) paths */
 	generate_gather_paths(root, rel, override_rows);
 
-	/* when incremental sort is disabled, we're done */
-	if (!enable_incrementalsort)
-		return;
-
 	/* consider incremental sort for interesting orderings */
 	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
 
@@ -2911,7 +2907,7 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 			 * Consider incremental sort, but only when the subpath is already
 			 * partially sorted on a pathkey prefix.
 			 */
-			if (presorted_keys > 0)
+			if (enable_incrementalsort && presorted_keys > 0)
 			{
 				Path	   *tmp;
 
-- 
2.17.1

v44-0007-explain-fixes.patchtext/x-patch; charset=US-ASCII; name=v44-0007-explain-fixes.patchDownload

From 71da5c3d406efe9a11dd9cf73646d7dc7e510122 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sat, 28 Mar 2020 22:35:49 -0400
Subject: [PATCH v44 7/7] explain fixes

---
 src/backend/commands/explain.c             | 70 ++++++++++++----------
 src/backend/executor/nodeIncrementalSort.c | 34 ++++++-----
 src/include/nodes/execnodes.h              |  2 +-
 src/include/utils/tuplesort.h              | 11 ++--
 4 files changed, 65 insertions(+), 52 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 39d51848b6..24acde506e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2713,26 +2713,41 @@ show_sort_info(SortState *sortstate, ExplainState *es)
  */
 static void
 show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
-								 const char *groupLabel, ExplainState *es)
+								 const char *groupLabel, bool indent, ExplainState *es)
 {
 	ListCell   *methodCell;
-	int			methodCount = list_length(groupInfo->sortMethods);
+	List	   *methodNames = NIL;
+
+	/* Generate a list of sort methods used across all groups. */
+	for (int bit = 0; bit < sizeof(Size); ++bit)
+	{
+		if (groupInfo->sortMethods & (1 << bit))
+		{
+			TuplesortMethod sortMethod = (1 << bit);
+			const char *methodName;
+
+			methodName = tuplesort_method_name(sortMethod);
+			methodNames = lappend(methodNames, unconstify(char *, methodName));
+		}
+	}
 
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
-		appendStringInfoSpaces(es->str, es->indent * 2);
-		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+		if (indent)
+			appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld Sort Method", groupLabel,
 						 groupInfo->groupCount);
-		foreach(methodCell, groupInfo->sortMethods)
+		/* plural/singular based on methodNames size */
+		if (list_length(methodNames) > 1)
+			appendStringInfo(es->str, "s: ");
+		else
+			appendStringInfo(es->str, ": ");
+		foreach(methodCell, methodNames)
 		{
-			const char *sortMethodName;
-
-			sortMethodName = tuplesort_method_name(methodCell->int_value);
-			appendStringInfo(es->str, "%s", sortMethodName);
-			if (foreach_current_index(methodCell) < methodCount - 1)
+			appendStringInfo(es->str, "%s", (char *) methodCell->ptr_value);
+			if (foreach_current_index(methodCell) < list_length(methodNames) - 1)
 				appendStringInfo(es->str, ", ");
 		}
-		appendStringInfo(es->str, ")");
 
 		if (groupInfo->maxMemorySpaceUsed > 0)
 		{
@@ -2740,7 +2755,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 			const char *spaceTypeName;
 
 			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
-			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
 							 spaceTypeName, avgSpace,
 							 groupInfo->maxMemorySpaceUsed);
 		}
@@ -2755,7 +2770,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 			/* Add a semicolon separator only if memory stats were printed. */
 			if (groupInfo->maxMemorySpaceUsed > 0)
 				appendStringInfo(es->str, ";");
-			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
 							 spaceTypeName, avgSpace,
 							 groupInfo->maxDiskSpaceUsed);
 		}
@@ -2764,7 +2779,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 	}
 	else
 	{
-		List	   *methodNames = NIL;
 		StringInfoData groupName;
 
 		initStringInfo(&groupName);
@@ -2772,12 +2786,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
 		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
 
-		foreach(methodCell, groupInfo->sortMethods)
-		{
-			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
-
-			methodNames = lappend(methodNames, unconstify(char *, sortMethodName));
-		}
 		ExplainPropertyList("Sort Methods Used", methodNames, es);
 
 		if (groupInfo->maxMemorySpaceUsed > 0)
@@ -2834,10 +2842,10 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
 		return;
 
-	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
 	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
 	if (prefixsortGroupInfo->groupCount > 0)
-		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", true, es);
 
 	if (incrsortstate->shared_info != NULL)
 	{
@@ -2860,20 +2868,18 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 				prefixsortGroupInfo->groupCount == 0)
 				continue;
 
-			if (!opened_group)
-			{
-				ExplainOpenGroup("Workers", "Workers", false, es);
-				opened_group = true;
-			}
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
 
 			if (fullsortGroupInfo->groupCount > 0)
-				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+												 es->workers_state == NULL, es);
 			if (prefixsortGroupInfo->groupCount > 0)
-				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
-		}
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", true, es);
 
-		if (opened_group)
-			ExplainCloseGroup("Workers", "Workers", false, es);
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
 	}
 }
 
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 9fe93d5979..004165ccc1 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -100,6 +100,22 @@ instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
 	TuplesortInstrumentation sort_instr;
 
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		/*
+		 * XXX This is a bit a gross, but it would take inventing some new
+		 * enum or flag to this method to make it not necessary...
+		 */
+		if (groupInfo == &node->incsort_info.fullsortGroupInfo)
+			groupInfo = &node->shared_info->sinfo[ParallelWorkerNumber].fullsortGroupInfo;
+		else if (groupInfo == &node->incsort_info.prefixsortGroupInfo)
+			groupInfo = &node->shared_info->sinfo[ParallelWorkerNumber].prefixsortGroupInfo;
+	}
+
 	groupInfo->groupCount++;
 
 	tuplesort_get_stats(sortState, &sort_instr);
@@ -122,19 +138,7 @@ instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 	}
 
 	/* Track each sort method we've used. */
-	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
-		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
-											 sort_instr.sortMethod);
-
-	/* Record shared stats if we're a parallel worker. */
-	if (node->shared_info && node->am_worker)
-	{
-		Assert(IsParallelWorker());
-		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
-
-		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
-			   &node->incsort_info, sizeof(IncrementalSortInfo));
-	}
+	groupInfo->sortMethods |= sort_instr.sortMethod;
 }
 
 /* ----------------------------------------------------------------
@@ -1026,13 +1030,13 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 		fullsortGroupInfo->totalDiskSpaceUsed = 0;
 		fullsortGroupInfo->maxMemorySpaceUsed = 0;
 		fullsortGroupInfo->totalMemorySpaceUsed = 0;
-		fullsortGroupInfo->sortMethods = NIL;
+		fullsortGroupInfo->sortMethods = 0;
 		prefixsortGroupInfo->groupCount = 0;
 		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
 		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
 		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
 		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
-		prefixsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->sortMethods = 0;
 	}
 
 	/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6127ab5912..8d1b944472 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2034,7 +2034,7 @@ typedef struct IncrementalSortGroupInfo
 	long		totalDiskSpaceUsed;
 	long		maxMemorySpaceUsed;
 	long		totalMemorySpaceUsed;
-	List	   *sortMethods;
+	Size		sortMethods; /* bitmask of TuplesortMethod */
 } IncrementalSortGroupInfo;
 
 typedef struct IncrementalSortInfo
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 0e9ab4e586..96e970339c 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -61,14 +61,17 @@ typedef struct SortCoordinateData *SortCoordinate;
  * Data structures for reporting sort statistics.  Note that
  * TuplesortInstrumentation can't contain any pointers because we
  * sometimes put it in shared memory.
+ *
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
  */
 typedef enum
 {
 	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_TOP_N_HEAPSORT = 2,
+	SORT_TYPE_QUICKSORT = 4,
+	SORT_TYPE_EXTERNAL_SORT = 8,
+	SORT_TYPE_EXTERNAL_MERGE = 16
 } TuplesortMethod;
 
 typedef enum
-- 
2.17.1

#248

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#247)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 28, 2020 at 10:47:49PM -0400, James Coleman wrote:

On Sat, Mar 28, 2020 at 6:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

Attached is my take on simplification of the useful pathkeyes thing. It
keeps the function, but it truncates query_pathkeys to only members with
EC members in the relation. I think that's essentially the optimization
you've proposed.

Thanks. I've included that in the patch series in this email (as a
separate patch) with a few additional comments.

Thanks.

I've also noticed that the enabled_incrementalsort check in
generate_useful_gather_paths seemed broken, because it returned us out
of the function before creating either a plain gather merge (if
already sorted) or an explicit sort path. I've included a patch that
moves it to the if block that actually builds the incremental sort
path.

Hmmm, that's probably right.

The reason why the GUC check was right after generate_gather_paths is
that the intent to disable all the useful-pathkeys business, essentially
reducing it back to plain generate_gather_paths.

But I think you're right that's wrong, because it might lead to strange
behavior when the GUC switches between plans without any incremental
sort nodes - setting it to 'true' might end up picking GM on top of
plain Sort, for example.

...

1) Missing worker identification (Worker #).

Fixed.

2) Missing method for workers (we have it for the leader, though).

Fixed. Since we can't have pointers in the parallel shared memory
space, we can't store the sort methods used in a list. To accomplish
the same goal, I've assigned the TuplesortMethod enum entries uique
bit positions, and store the methods used in a bitmask.

OK, makes sense.

3) I'm not sure why the lable is "Methods" instead of "Sort Method", and
why it's in parenthesis.

I've removed the parentheses. It's labeled "Methods" since there can
be more than one (different batches could use different methods). I've
updated this to properly use singular/plural depending on the number
of methods used.

I'm a bit confused. How do I know which batch used which method? Is it
actually worth printing in explain analyze? Maybe only print it in the
verbose mode?

4) Not sure having two lines for each worker is a great idea.

I've left these in for now because the lines are already very long
(much, much longer than the worker lines in a standard sort node).
This is largely because we're trying to summarize many sort batches,
while standard sort nodes only have to give the exact stats from a
single batch.

See the example output later in the email.

5) I'd probably prefer having multiple labels for avg/max memory values,
instead of (avg) and (max) notes. Also, I think we use "peak" in this
context instead of "max".

Updated.

Here's the current output:

Limit (cost=1887419.20..1889547.68 rows=10000 width=8) (actual
time=13218.403..13222.519 rows=10000 loo
ps=1)
-> Gather Merge (cost=1887419.20..19624748.03 rows=83333360
width=8) (actual time=13218.401..13229.7
50 rows=10000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=1886419.17..10005010.55
rows=41666680 width=8) (actual time=13208.00
4..13208.586 rows=4425 loops=3)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 1 Sort Method: quicksort Memory:
avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort Memory:
avg=1681kB peak=1681kB
Worker 0: Full-sort Groups: 1 Sort Method: quicksort
Memory: avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort
Memory: avg=1680kB peak=1680kB
Worker 1: Full-sort Groups: 1 Sort Method: quicksort
Memory: avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort
Memory: avg=1682kB peak=1682kB
-> Parallel Index Scan using index_s_a on s
(cost=0.57..4967182.06 rows=41666680 width=8
) (actual time=0.455..11730.878 rows=6666668 loops=3)

James

Looks reasonable. Did you try it in other output formats - json/yaml?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#249

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#248)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 28, 2020 at 11:14 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 28, 2020 at 10:47:49PM -0400, James Coleman wrote:

On Sat, Mar 28, 2020 at 6:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

Attached is my take on simplification of the useful pathkeyes thing. It
keeps the function, but it truncates query_pathkeys to only members with
EC members in the relation. I think that's essentially the optimization
you've proposed.

Thanks. I've included that in the patch series in this email (as a
separate patch) with a few additional comments.

Thanks.

I've also noticed that the enabled_incrementalsort check in
generate_useful_gather_paths seemed broken, because it returned us out
of the function before creating either a plain gather merge (if
already sorted) or an explicit sort path. I've included a patch that
moves it to the if block that actually builds the incremental sort
path.

Hmmm, that's probably right.

The reason why the GUC check was right after generate_gather_paths is
that the intent to disable all the useful-pathkeys business, essentially
reducing it back to plain generate_gather_paths.

But I think you're right that's wrong, because it might lead to strange
behavior when the GUC switches between plans without any incremental
sort nodes - setting it to 'true' might end up picking GM on top of
plain Sort, for example.

Thanks.

...

1) Missing worker identification (Worker #).

Fixed.

2) Missing method for workers (we have it for the leader, though).

Fixed. Since we can't have pointers in the parallel shared memory
space, we can't store the sort methods used in a list. To accomplish
the same goal, I've assigned the TuplesortMethod enum entries uique
bit positions, and store the methods used in a bitmask.

OK, makes sense.

3) I'm not sure why the lable is "Methods" instead of "Sort Method", and
why it's in parenthesis.

I've removed the parentheses. It's labeled "Methods" since there can
be more than one (different batches could use different methods). I've
updated this to properly use singular/plural depending on the number
of methods used.

I'm a bit confused. How do I know which batch used which method? Is it
actually worth printing in explain analyze? Maybe only print it in the
verbose mode?

The alternative is showing no sort method information at all, or only
showing it if all batches used the same method (which seems confusing
to me). It seems weird that we wouldn't try to find some rough
analogue to what a regular sort node outputs, so I've attempted to
summarize.

This is similar to the memory information: the average doesn't apply
to any one batch, and you don't know which one (or how many) hit the
peak memory usage either, but I think it's meaningful to know a
summary.

With the sort methods, I think it's useful to be able to, for example,
know if any of the groups happened to trigger the top-n heapsort
optimization, or not, and as a corollary, if all of them did or not.

4) Not sure having two lines for each worker is a great idea.

I've left these in for now because the lines are already very long
(much, much longer than the worker lines in a standard sort node).
This is largely because we're trying to summarize many sort batches,
while standard sort nodes only have to give the exact stats from a
single batch.

See the example output later in the email.

OK

5) I'd probably prefer having multiple labels for avg/max memory values,
instead of (avg) and (max) notes. Also, I think we use "peak" in this
context instead of "max".

Updated.

OK

Here's the current output:

Limit (cost=1887419.20..1889547.68 rows=10000 width=8) (actual
time=13218.403..13222.519 rows=10000 loo
ps=1)
-> Gather Merge (cost=1887419.20..19624748.03 rows=83333360
width=8) (actual time=13218.401..13229.7
50 rows=10000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=1886419.17..10005010.55
rows=41666680 width=8) (actual time=13208.00
4..13208.586 rows=4425 loops=3)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 1 Sort Method: quicksort Memory:
avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort Memory:
avg=1681kB peak=1681kB
Worker 0: Full-sort Groups: 1 Sort Method: quicksort
Memory: avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort
Memory: avg=1680kB peak=1680kB
Worker 1: Full-sort Groups: 1 Sort Method: quicksort
Memory: avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort
Memory: avg=1682kB peak=1682kB
-> Parallel Index Scan using index_s_a on s
(cost=0.57..4967182.06 rows=41666680 width=8
) (actual time=0.455..11730.878 rows=6666668 loops=3)

Looks reasonable. Did you try it in other output formats - json/yaml?

I did. JSON looks good also, which implies to me yaml would to (but I
didn't look at it).

James

#250

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#249)

7 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Mar 28, 2020 at 11:23 PM James Coleman <jtc331@gmail.com> wrote:

On Sat, Mar 28, 2020 at 11:14 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sat, Mar 28, 2020 at 10:47:49PM -0400, James Coleman wrote:

On Sat, Mar 28, 2020 at 6:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

Attached is my take on simplification of the useful pathkeyes thing. It
keeps the function, but it truncates query_pathkeys to only members with
EC members in the relation. I think that's essentially the optimization
you've proposed.

Thanks. I've included that in the patch series in this email (as a
separate patch) with a few additional comments.

Thanks.

I've also noticed that the enabled_incrementalsort check in
generate_useful_gather_paths seemed broken, because it returned us out
of the function before creating either a plain gather merge (if
already sorted) or an explicit sort path. I've included a patch that
moves it to the if block that actually builds the incremental sort
path.

Hmmm, that's probably right.

The reason why the GUC check was right after generate_gather_paths is
that the intent to disable all the useful-pathkeys business, essentially
reducing it back to plain generate_gather_paths.

But I think you're right that's wrong, because it might lead to strange
behavior when the GUC switches between plans without any incremental
sort nodes - setting it to 'true' might end up picking GM on top of
plain Sort, for example.

Thanks.

...

1) Missing worker identification (Worker #).

Fixed.

2) Missing method for workers (we have it for the leader, though).

Fixed. Since we can't have pointers in the parallel shared memory
space, we can't store the sort methods used in a list. To accomplish
the same goal, I've assigned the TuplesortMethod enum entries uique
bit positions, and store the methods used in a bitmask.

OK, makes sense.

3) I'm not sure why the lable is "Methods" instead of "Sort Method", and
why it's in parenthesis.

I've removed the parentheses. It's labeled "Methods" since there can
be more than one (different batches could use different methods). I've
updated this to properly use singular/plural depending on the number
of methods used.

I'm a bit confused. How do I know which batch used which method? Is it
actually worth printing in explain analyze? Maybe only print it in the
verbose mode?

The alternative is showing no sort method information at all, or only
showing it if all batches used the same method (which seems confusing
to me). It seems weird that we wouldn't try to find some rough
analogue to what a regular sort node outputs, so I've attempted to
summarize.

This is similar to the memory information: the average doesn't apply
to any one batch, and you don't know which one (or how many) hit the
peak memory usage either, but I think it's meaningful to know a
summary.

With the sort methods, I think it's useful to be able to, for example,
know if any of the groups happened to trigger the top-n heapsort
optimization, or not, and as a corollary, if all of them did or not.

4) Not sure having two lines for each worker is a great idea.

I've left these in for now because the lines are already very long
(much, much longer than the worker lines in a standard sort node).
This is largely because we're trying to summarize many sort batches,
while standard sort nodes only have to give the exact stats from a
single batch.

See the example output later in the email.

OK

5) I'd probably prefer having multiple labels for avg/max memory values,
instead of (avg) and (max) notes. Also, I think we use "peak" in this
context instead of "max".

Updated.

OK

Here's the current output:

Limit (cost=1887419.20..1889547.68 rows=10000 width=8) (actual
time=13218.403..13222.519 rows=10000 loo
ps=1)
-> Gather Merge (cost=1887419.20..19624748.03 rows=83333360
width=8) (actual time=13218.401..13229.7
50 rows=10000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=1886419.17..10005010.55
rows=41666680 width=8) (actual time=13208.00
4..13208.586 rows=4425 loops=3)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 1 Sort Method: quicksort Memory:
avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort Memory:
avg=1681kB peak=1681kB
Worker 0: Full-sort Groups: 1 Sort Method: quicksort
Memory: avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort
Memory: avg=1680kB peak=1680kB
Worker 1: Full-sort Groups: 1 Sort Method: quicksort
Memory: avg=28kB peak=28kB
Presorted Groups: 1 Sort Method: top-N heapsort
Memory: avg=1682kB peak=1682kB
-> Parallel Index Scan using index_s_a on s
(cost=0.57..4967182.06 rows=41666680 width=8
) (actual time=0.455..11730.878 rows=6666668 loops=3)

Looks reasonable. Did you try it in other output formats - json/yaml?

I did. JSON looks good also, which implies to me yaml would to (but I
didn't look at it).

After sleeping on it, I decided the XXX I'd left in my explain fixes
was too gross to keep around, so I've replaced it with a macro that
selects the proper shared or private memory group info struct (that
way we avoid the pointer comparison hack to reverse engineer (in the
parallel worker case) which group info we were looking to use.

James

Attachments:

v45-0004-A-couple-more-places-for-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v45-0004-A-couple-more-places-for-incremental-sort.patchDownload

From 6ffe301df13f067893a81b060fcc6fab950e48b4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 16:03:56 +0200
Subject: [PATCH v45 4/7] A couple more places for incremental sort

---
 src/backend/optimizer/geqo/geqo_eval.c |   2 +-
 src/backend/optimizer/plan/planner.c   | 218 ++++++++++++++++++++++++-
 2 files changed, 215 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 35e770f241..881302d0a3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5077,6 +5077,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6511,7 +6572,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			/* We've already skipped fully sorted paths above. */
 			Assert(!is_sorted);
 
-			/* no shared prefix, not point in building incremental sort */
+			/* no shared prefix, no point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
@@ -6577,12 +6638,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6613,6 +6680,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6884,6 +7001,58 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/* Consider incremental sort on all partial paths, if enabled. */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -7076,10 +7245,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -7105,6 +7275,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7206,7 +7416,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
-- 
2.17.1

v45-0005-rework-of-get_useful_pathkeys_for_relation.patchtext/x-patch; charset=US-ASCII; name=v45-0005-rework-of-get_useful_pathkeys_for_relation.patchDownload

From cf47f29fa4c254bf14d7107e75dfd432350dbfcf Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sat, 28 Mar 2020 20:03:27 -0400
Subject: [PATCH v45 5/7] rework of get_useful_pathkeys_for_relation

---
 src/backend/optimizer/path/allpaths.c | 32 +++++++++++++++------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 32bf734820..480803fb7a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2742,13 +2742,13 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
  * XXX At the moment this can only ever return a list with a single element,
  * because it looks at query_pathkeys only. So we might return the pathkeys
  * directly, but it seems plausible we'll want to consider other orderings
- * in the future.
+ * in the future. For example, we might want to consider pathkeys useful for
+ * merge joins.
  */
 static List *
 get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 {
 	List	   *useful_pathkeys_list = NIL;
-	ListCell   *lc;
 
 	/*
 	 * Considering query_pathkeys is always worth it, because it might allow us
@@ -2756,29 +2756,33 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 	 */
 	if (root->query_pathkeys)
 	{
-		bool		query_pathkeys_ok = true;
+		ListCell   *lc;
+		List	   *pathkeys = NIL;
 
 		foreach(lc, root->query_pathkeys)
 		{
 			PathKey    *pathkey = (PathKey *) lfirst(lc);
 			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
-			Expr	   *em_expr;
 
 			/*
-			 * We can't use incremental sort for pathkeys containing volatile
-			 * expressions. We could walk the exppression itself, but checking
-			 * ec_has_volatile here saves some cycles.
+			 * We can only build an Incremental Sort for pathkeys which contain
+			 * an EC member in the current relation, so ignore any suffix of the
+			 * list as soon as we find a pathkey without an EC member the
+			 * relation.
+			 *
+			 * By still returning the prefix of the pathkeys list that does meet
+			 * criteria of EC membership in the current relation, we enable not
+			 * just an incremental sort on the entirety of query_pathkeys but
+			 * also incremental sort below a JOIN.
 			 */
-			if (pathkey_ec->ec_has_volatile ||
-				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
-			{
-				query_pathkeys_ok = false;
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
 				break;
-			}
+
+			pathkeys = lappend(pathkeys, pathkey);
 		}
 
-		if (query_pathkeys_ok)
-			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+		if (pathkeys)
+			useful_pathkeys_list = lappend(useful_pathkeys_list, pathkeys);
 	}
 
 	return useful_pathkeys_list;
-- 
2.17.1

v45-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v45-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From e4a0edb72e456e2aea6dcfa69d33a58302f2b22a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v45 1/7] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v45-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v45-0002-Implement-incremental-sort.patchDownload

From 0ae6e4c64d16c579cd32557fafd5869f0334ada9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v45 2/7] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   22 +
 src/backend/commands/explain.c                |  223 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1267 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   63 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   74 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  307 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1400 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 40 files changed, 4144 insertions(+), 160 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 70854ae298..47ceea43d9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4542,6 +4542,28 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort, which
+        allows the planner to take advantage of data presorted on columns
+        <literal>1..m</literal> when an ordering on columns <literal>1..n</literal>
+        (where <literal>m < n</literal>) is required. Compared to regular sorts,
+        incremental sort allows returning tuples before the entire result set
+        has been sorted, particularly enabling optimizations with
+        <literal>LIMIT</literal> queries. It may also reduce memory usage and
+        the likelihood of spilling sorts to disk, but comes at the cost of
+        increased overhead splitting the result set into multiple sorting
+        batches. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 58141d8393..39d51848b6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,180 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, unconstify(char *, sortMethodName));
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..9fe93d5979
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1267 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b61d..a59926fa02 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..be569f56fd 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1831,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..1d7d4eb3e7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5da0528382..423ac25827 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,13 +4922,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4962,29 +4965,66 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..e20c055dea 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af876d1f01..b6ce724557 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aa44f0c9bf..bc2c2dbb1b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -359,6 +359,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..99d64a88af 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1223,17 +1295,25 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1371,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2765,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2815,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3312,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50f09..6127ab5912 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2023,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instruementation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..136d794219 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..ebb8412237
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1400 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                  
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: NNkB (avg), NNkB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "quicksort",                    +
+                 "top-N heapsort"                +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                         explain_analyze_without_memory                          
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

v45-0003-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v45-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From 45b95f4631d808ed74811d32fb04c0401515cd8a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v45 3/7] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 ----
 src/backend/optimizer/path/allpaths.c   | 208 +++++++++++++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++++
 src/backend/optimizer/plan/planner.c    | 130 ++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 5 files changed, 366 insertions(+), 32 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..32bf734820 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,210 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+	ListCell   *lc;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		bool		query_pathkeys_ok = true;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+			Expr	   *em_expr;
+
+			/*
+			 * We can't use incremental sort for pathkeys containing volatile
+			 * expressions. We could walk the exppression itself, but checking
+			 * ec_has_volatile here saves some cycles.
+			 */
+			if (pathkey_ec->ec_has_volatile ||
+				!(em_expr = find_em_expr_for_rel(pathkey_ec, rel)))
+			{
+				query_pathkeys_ok = false;
+				break;
+			}
+		}
+
+		if (query_pathkeys_ok)
+			useful_pathkeys_list = list_make1(list_copy(root->query_pathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* when incremental sort is disabled, we're done */
+	if (!enable_incrementalsort)
+		return;
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3103,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 423ac25827..35e770f241 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6431,7 +6431,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6490,6 +6492,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6816,7 +6892,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6851,6 +6929,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -7232,7 +7360,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..665f4065a4 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.17.1

v45-0006-fix-inc-sort-enabled-check.patchtext/x-patch; charset=US-ASCII; name=v45-0006-fix-inc-sort-enabled-check.patchDownload

From 27d1e6ebd84515465c2fe70f664048124fcf3681 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sat, 28 Mar 2020 20:04:03 -0400
Subject: [PATCH v45 6/7] fix inc sort enabled check

---
 src/backend/optimizer/path/allpaths.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 480803fb7a..93d967e812 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2818,10 +2818,6 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 	/* generate the regular gather (merge) paths */
 	generate_gather_paths(root, rel, override_rows);
 
-	/* when incremental sort is disabled, we're done */
-	if (!enable_incrementalsort)
-		return;
-
 	/* consider incremental sort for interesting orderings */
 	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
 
@@ -2911,7 +2907,7 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 			 * Consider incremental sort, but only when the subpath is already
 			 * partially sorted on a pathkey prefix.
 			 */
-			if (presorted_keys > 0)
+			if (enable_incrementalsort && presorted_keys > 0)
 			{
 				Path	   *tmp;
 
-- 
2.17.1

v45-0007-explain-fixes.patchtext/x-patch; charset=US-ASCII; name=v45-0007-explain-fixes.patchDownload

From 28199b4ffe9a59dc3fab29b30af762eed0040c81 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sat, 28 Mar 2020 22:35:49 -0400
Subject: [PATCH v45 7/7] explain fixes

---
 src/backend/commands/explain.c             | 70 ++++++++++++----------
 src/backend/executor/nodeIncrementalSort.c | 63 ++++++++++---------
 src/include/nodes/execnodes.h              |  2 +-
 src/include/utils/tuplesort.h              | 11 ++--
 4 files changed, 76 insertions(+), 70 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 39d51848b6..24acde506e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2713,26 +2713,41 @@ show_sort_info(SortState *sortstate, ExplainState *es)
  */
 static void
 show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
-								 const char *groupLabel, ExplainState *es)
+								 const char *groupLabel, bool indent, ExplainState *es)
 {
 	ListCell   *methodCell;
-	int			methodCount = list_length(groupInfo->sortMethods);
+	List	   *methodNames = NIL;
+
+	/* Generate a list of sort methods used across all groups. */
+	for (int bit = 0; bit < sizeof(Size); ++bit)
+	{
+		if (groupInfo->sortMethods & (1 << bit))
+		{
+			TuplesortMethod sortMethod = (1 << bit);
+			const char *methodName;
+
+			methodName = tuplesort_method_name(sortMethod);
+			methodNames = lappend(methodNames, unconstify(char *, methodName));
+		}
+	}
 
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
-		appendStringInfoSpaces(es->str, es->indent * 2);
-		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+		if (indent)
+			appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld Sort Method", groupLabel,
 						 groupInfo->groupCount);
-		foreach(methodCell, groupInfo->sortMethods)
+		/* plural/singular based on methodNames size */
+		if (list_length(methodNames) > 1)
+			appendStringInfo(es->str, "s: ");
+		else
+			appendStringInfo(es->str, ": ");
+		foreach(methodCell, methodNames)
 		{
-			const char *sortMethodName;
-
-			sortMethodName = tuplesort_method_name(methodCell->int_value);
-			appendStringInfo(es->str, "%s", sortMethodName);
-			if (foreach_current_index(methodCell) < methodCount - 1)
+			appendStringInfo(es->str, "%s", (char *) methodCell->ptr_value);
+			if (foreach_current_index(methodCell) < list_length(methodNames) - 1)
 				appendStringInfo(es->str, ", ");
 		}
-		appendStringInfo(es->str, ")");
 
 		if (groupInfo->maxMemorySpaceUsed > 0)
 		{
@@ -2740,7 +2755,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 			const char *spaceTypeName;
 
 			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
-			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
 							 spaceTypeName, avgSpace,
 							 groupInfo->maxMemorySpaceUsed);
 		}
@@ -2755,7 +2770,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 			/* Add a semicolon separator only if memory stats were printed. */
 			if (groupInfo->maxMemorySpaceUsed > 0)
 				appendStringInfo(es->str, ";");
-			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
 							 spaceTypeName, avgSpace,
 							 groupInfo->maxDiskSpaceUsed);
 		}
@@ -2764,7 +2779,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 	}
 	else
 	{
-		List	   *methodNames = NIL;
 		StringInfoData groupName;
 
 		initStringInfo(&groupName);
@@ -2772,12 +2786,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
 		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
 
-		foreach(methodCell, groupInfo->sortMethods)
-		{
-			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
-
-			methodNames = lappend(methodNames, unconstify(char *, sortMethodName));
-		}
 		ExplainPropertyList("Sort Methods Used", methodNames, es);
 
 		if (groupInfo->maxMemorySpaceUsed > 0)
@@ -2834,10 +2842,10 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
 		return;
 
-	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
 	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
 	if (prefixsortGroupInfo->groupCount > 0)
-		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", true, es);
 
 	if (incrsortstate->shared_info != NULL)
 	{
@@ -2860,20 +2868,18 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 				prefixsortGroupInfo->groupCount == 0)
 				continue;
 
-			if (!opened_group)
-			{
-				ExplainOpenGroup("Workers", "Workers", false, es);
-				opened_group = true;
-			}
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
 
 			if (fullsortGroupInfo->groupCount > 0)
-				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+												 es->workers_state == NULL, es);
 			if (prefixsortGroupInfo->groupCount > 0)
-				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
-		}
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", true, es);
 
-		if (opened_group)
-			ExplainCloseGroup("Workers", "Workers", false, es);
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
 	}
 }
 
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 9fe93d5979..eb5515beb8 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -85,6 +85,27 @@
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
 
+/*
+ * We need to store the instrumentation information in either local node's sort
+ * info or, for a parallel worker process, in the shared info (this avoids
+ * having to additionally memcpy the info from local memory to shared memory
+ * at each instrumentation call). This macro expands to choose the proper sort
+ * state and group info.
+ *
+ * Arguments:
+ * - node: type IncrementalSortState *
+ * - groupName: the token fullsort or prefixsort
+ */
+#define INSTRUMENT_SORT_GROUP(node, groupName) \
+	if (node->shared_info && node->am_worker) \
+	{ \
+		Assert(IsParallelWorker()); \
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers); \
+		instrumentSortedGroup(&node->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, node->groupName##_state); \
+	} else { \
+		instrumentSortedGroup(&node->incsort_info.groupName##GroupInfo, node->groupName##_state); \
+	}
+
 /* ----------------------------------------------------------------
  * instrumentSortedGroup
  *
@@ -94,12 +115,10 @@
  * ----------------------------------------------------------------
  */
 static void
-instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+instrumentSortedGroup(IncrementalSortGroupInfo *groupInfo,
 					  Tuplesortstate *sortState)
 {
-	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
 	TuplesortInstrumentation sort_instr;
-
 	groupInfo->groupCount++;
 
 	tuplesort_get_stats(sortState, &sort_instr);
@@ -122,19 +141,7 @@ instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 	}
 
 	/* Track each sort method we've used. */
-	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
-		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
-											 sort_instr.sortMethod);
-
-	/* Record shared stats if we're a parallel worker. */
-	if (node->shared_info && node->am_worker)
-	{
-		Assert(IsParallelWorker());
-		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
-
-		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
-			   &node->incsort_info, sizeof(IncrementalSortInfo));
-	}
+	groupInfo->sortMethods |= sort_instr.sortMethod;
 }
 
 /* ----------------------------------------------------------------
@@ -435,9 +442,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 		tuplesort_performsort(node->prefixsort_state);
 
 		if (pstate->instrument != NULL)
-			instrumentSortedGroup(pstate,
-								  &node->incsort_info.prefixsortGroupInfo,
-								  node->prefixsort_state);
+			INSTRUMENT_SORT_GROUP(node, prefixsort)
 
 		if (node->bounded)
 		{
@@ -696,9 +701,7 @@ ExecIncrementalSort(PlanState *pstate)
 				tuplesort_performsort(fullsort_state);
 
 				if (pstate->instrument != NULL)
-					instrumentSortedGroup(pstate,
-										  &node->incsort_info.fullsortGroupInfo,
-										  fullsort_state);
+					INSTRUMENT_SORT_GROUP(node, fullsort)
 
 				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
 				node->execution_status = INCSORT_READFULLSORT;
@@ -780,9 +783,7 @@ ExecIncrementalSort(PlanState *pstate)
 					tuplesort_performsort(fullsort_state);
 
 					if (pstate->instrument != NULL)
-						instrumentSortedGroup(pstate,
-											  &node->incsort_info.fullsortGroupInfo,
-											  fullsort_state);
+						INSTRUMENT_SORT_GROUP(node, fullsort)
 
 					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
 					node->execution_status = INCSORT_READFULLSORT;
@@ -822,9 +823,7 @@ ExecIncrementalSort(PlanState *pstate)
 				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
 				tuplesort_performsort(fullsort_state);
 				if (pstate->instrument != NULL)
-					instrumentSortedGroup(pstate,
-										  &node->incsort_info.fullsortGroupInfo,
-										  fullsort_state);
+					INSTRUMENT_SORT_GROUP(node, fullsort)
 
 				/*
 				 * If the full sort tuplesort happened to switch into top-n
@@ -938,9 +937,7 @@ ExecIncrementalSort(PlanState *pstate)
 		tuplesort_performsort(node->prefixsort_state);
 
 		if (pstate->instrument != NULL)
-			instrumentSortedGroup(pstate,
-								  &node->incsort_info.prefixsortGroupInfo,
-								  node->prefixsort_state);
+			INSTRUMENT_SORT_GROUP(node, prefixsort)
 
 		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
 		node->execution_status = INCSORT_READPREFIXSORT;
@@ -1026,13 +1023,13 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 		fullsortGroupInfo->totalDiskSpaceUsed = 0;
 		fullsortGroupInfo->maxMemorySpaceUsed = 0;
 		fullsortGroupInfo->totalMemorySpaceUsed = 0;
-		fullsortGroupInfo->sortMethods = NIL;
+		fullsortGroupInfo->sortMethods = 0;
 		prefixsortGroupInfo->groupCount = 0;
 		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
 		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
 		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
 		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
-		prefixsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->sortMethods = 0;
 	}
 
 	/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6127ab5912..8d1b944472 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2034,7 +2034,7 @@ typedef struct IncrementalSortGroupInfo
 	long		totalDiskSpaceUsed;
 	long		maxMemorySpaceUsed;
 	long		totalMemorySpaceUsed;
-	List	   *sortMethods;
+	Size		sortMethods; /* bitmask of TuplesortMethod */
 } IncrementalSortGroupInfo;
 
 typedef struct IncrementalSortInfo
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 0e9ab4e586..96e970339c 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -61,14 +61,17 @@ typedef struct SortCoordinateData *SortCoordinate;
  * Data structures for reporting sort statistics.  Note that
  * TuplesortInstrumentation can't contain any pointers because we
  * sometimes put it in shared memory.
+ *
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
  */
 typedef enum
 {
 	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_TOP_N_HEAPSORT = 2,
+	SORT_TYPE_QUICKSORT = 4,
+	SORT_TYPE_EXTERNAL_SORT = 8,
+	SORT_TYPE_EXTERNAL_MERGE = 16
 } TuplesortMethod;
 
 typedef enum
-- 
2.17.1

#251

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#250)

4 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Hi,

Attached is a slightly reorganized patch series. I've merged the fixes
into the appropriate matches, and I've also combined the two patches
adding incremental sort paths to additional places in planner.

A couple more comments:

1) I think the GUC documentation in src/sgml/config.sgml is a bit too
detailed, compared to the other enable_* GUCs. I wonder if there's a
better place where to move the details. What about adding some examples
and explanation to perform.sgml?

2) Looking at the explain output, the verbose mode looks like this:

test=# explain (verbose, analyze) select a from t order by a, b, c;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=66.31..816072.71 rows=8333226 width=24) (actual time=4.787..20092.555 rows=10000000 loops=1)
Output: a, b, c
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=66.28..729200.36 rows=4166613 width=24) (actual time=1.308..14021.575 rows=3333333 loops=3)
Output: a, b, c
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB peak=138kB
Worker 0: actual time=0.766..16122.368 rows=3841573 loops=1
Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB peak=141kB
Worker 1: actual time=1.986..16189.831 rows=3845490 loops=1
Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB peak=139kB
-> Parallel Index Scan using t_a_b_idx on public.t (cost=0.43..382365.92 rows=4166613 width=24) (actual time=0.040..9808.449 rows=3333333 loops=3)
Output: a, b, c
Worker 0: actual time=0.048..11275.178 rows=3841573 loops=1
Worker 1: actual time=0.041..11314.133 rows=3845490 loops=1
Planning Time: 0.166 ms
Execution Time: 25135.029 ms
(22 rows)

There seems to be missing indentation for the first line of worker info.

I'm still not quite convinced we should be printing two lines - I know
you mentioned the lines might be too long, but see how long the other
lines may get ...

3) I see the new nodes (plan state, ...) have "presortedCols" which does
not indicate it's a "number of". I think we usually prefix names of such
fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)

My TODO for this patch is this:

- review the costing (I think the estimates are OK, but I recall I
haven't been entirely happy with how it's broken into functions.)

- review the tuplesort changes (the memory contexts etc.)

- do more testing of performance impact on planning

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v46-0001-Consider-low-startup-cost-when-adding-partial-path.patchtext/plain; charset=us-asciiDownload

From a497288426699e5529776fe60cc72ed3d0af98d1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH 1/4] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.21.1

v46-0002-Implement-incremental-sort.patchtext/plain; charset=us-asciiDownload

From 612308a5a663d7398cd4666ec6ddbf2cee0376d1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH 2/4] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   22 +
 src/backend/commands/explain.c                |  223 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1267 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   74 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  306 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1400 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 40 files changed, 4141 insertions(+), 160 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..2f2e19dc64 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4554,6 +4554,28 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort, which
+        allows the planner to take advantage of data presorted on columns
+        <literal>1..m</literal> when an ordering on columns <literal>1..n</literal>
+        (where <literal>m < n</literal>) is required. Compared to regular sorts,
+        incremental sort allows returning tuples before the entire result set
+        has been sorted, particularly enabling optimizations with
+        <literal>LIMIT</literal> queries. It may also reduce memory usage and
+        the likelihood of spilling sorts to disk, but comes at the cost of
+        increased overhead splitting the result set into multiple sorting
+        batches. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ff2f45cfb2..85d7bcb78f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,180 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, ExplainState *es)
+{
+	ListCell   *methodCell;
+	int			methodCount = list_length(groupInfo->sortMethods);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+						 groupInfo->groupCount);
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName;
+
+			sortMethodName = tuplesort_method_name(methodCell->int_value);
+			appendStringInfo(es->str, "%s", sortMethodName);
+			if (foreach_current_index(methodCell) < methodCount - 1)
+				appendStringInfo(es->str, ", ");
+		}
+		appendStringInfo(es->str, ")");
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		List	   *methodNames = NIL;
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		foreach(methodCell, groupInfo->sortMethods)
+		{
+			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
+
+			methodNames = lappend(methodNames, unconstify(char *, sortMethodName));
+		}
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		opened_group = false;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (!opened_group)
+			{
+				ExplainOpenGroup("Workers", "Workers", false, es);
+				opened_group = true;
+			}
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..9fe93d5979
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1267 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	TuplesortInstrumentation sort_instr;
+
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
+		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
+											 sort_instr.sortMethod);
+
+	/* Record shared stats if we're a parallel worker. */
+	if (node->shared_info && node->am_worker)
+	{
+		Assert(IsParallelWorker());
+		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
+			   &node->incsort_info, sizeof(IncrementalSortInfo));
+	}
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					if (pstate->instrument != NULL)
+						instrumentSortedGroup(pstate,
+											  &node->incsort_info.fullsortGroupInfo,
+											  fullsort_state);
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+				if (pstate->instrument != NULL)
+					instrumentSortedGroup(pstate,
+										  &node->incsort_info.fullsortGroupInfo,
+										  fullsort_state);
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		if (pstate->instrument != NULL)
+			instrumentSortedGroup(pstate,
+								  &node->incsort_info.prefixsortGroupInfo,
+								  node->prefixsort_state);
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = NIL;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 9e7e57f118..8a52271692 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..3b84feaf7b 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,49 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1829,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..1d7d4eb3e7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b65abf6046..753e23676b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,13 +4922,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4962,29 +4965,66 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..e20c055dea 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..fe87d549d9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..427e5e967e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -358,6 +358,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..cc33a85731 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1296,23 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1370,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2764,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2814,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3311,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50f09..6127ab5912 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2023,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instruementation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	List	   *sortMethods;
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..136d794219 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..0e9ab4e586 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..ebb8412237
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1400 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                  
+-------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: NNkB (avg), NNkB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "quicksort",                    +
+                 "top-N heapsort"                +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                         explain_analyze_without_memory                          
+---------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
+         Presorted Groups: 5 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 69724d54b9..9ac816177e 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index 331d92708d..f63e71c075 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.21.1

v46-0003-explain-fixes.patchtext/plain; charset=us-asciiDownload

From df9c30e1e8a7761a47ea62ef9e768a9ce7ac4b87 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sat, 28 Mar 2020 22:35:49 -0400
Subject: [PATCH 3/4] explain fixes

---
 src/backend/commands/explain.c                | 71 +++++++++---------
 src/backend/executor/nodeIncrementalSort.c    | 72 +++++++++----------
 src/include/nodes/execnodes.h                 |  2 +-
 src/include/utils/tuplesort.h                 | 11 +--
 .../regress/expected/incremental_sort.out     | 18 ++---
 5 files changed, 89 insertions(+), 85 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 85d7bcb78f..8a0dcf09b0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2713,26 +2713,41 @@ show_sort_info(SortState *sortstate, ExplainState *es)
  */
 static void
 show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
-								 const char *groupLabel, ExplainState *es)
+								 const char *groupLabel, bool indent, ExplainState *es)
 {
 	ListCell   *methodCell;
-	int			methodCount = list_length(groupInfo->sortMethods);
+	List	   *methodNames = NIL;
+
+	/* Generate a list of sort methods used across all groups. */
+	for (int bit = 0; bit < sizeof(Size); ++bit)
+	{
+		if (groupInfo->sortMethods & (1 << bit))
+		{
+			TuplesortMethod sortMethod = (1 << bit);
+			const char *methodName;
+
+			methodName = tuplesort_method_name(sortMethod);
+			methodNames = lappend(methodNames, unconstify(char *, methodName));
+		}
+	}
 
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
-		appendStringInfoSpaces(es->str, es->indent * 2);
-		appendStringInfo(es->str, "%s Groups: %ld (Methods: ", groupLabel,
+		if (indent)
+			appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld Sort Method", groupLabel,
 						 groupInfo->groupCount);
-		foreach(methodCell, groupInfo->sortMethods)
+		/* plural/singular based on methodNames size */
+		if (list_length(methodNames) > 1)
+			appendStringInfo(es->str, "s: ");
+		else
+			appendStringInfo(es->str, ": ");
+		foreach(methodCell, methodNames)
 		{
-			const char *sortMethodName;
-
-			sortMethodName = tuplesort_method_name(methodCell->int_value);
-			appendStringInfo(es->str, "%s", sortMethodName);
-			if (foreach_current_index(methodCell) < methodCount - 1)
+			appendStringInfo(es->str, "%s", (char *) methodCell->ptr_value);
+			if (foreach_current_index(methodCell) < list_length(methodNames) - 1)
 				appendStringInfo(es->str, ", ");
 		}
-		appendStringInfo(es->str, ")");
 
 		if (groupInfo->maxMemorySpaceUsed > 0)
 		{
@@ -2740,7 +2755,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 			const char *spaceTypeName;
 
 			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
-			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
 							 spaceTypeName, avgSpace,
 							 groupInfo->maxMemorySpaceUsed);
 		}
@@ -2755,7 +2770,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 			/* Add a semicolon separator only if memory stats were printed. */
 			if (groupInfo->maxMemorySpaceUsed > 0)
 				appendStringInfo(es->str, ";");
-			appendStringInfo(es->str, " %s: %ldkB (avg), %ldkB (max)",
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
 							 spaceTypeName, avgSpace,
 							 groupInfo->maxDiskSpaceUsed);
 		}
@@ -2764,7 +2779,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 	}
 	else
 	{
-		List	   *methodNames = NIL;
 		StringInfoData groupName;
 
 		initStringInfo(&groupName);
@@ -2772,12 +2786,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
 		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
 
-		foreach(methodCell, groupInfo->sortMethods)
-		{
-			const char *sortMethodName = tuplesort_method_name(methodCell->int_value);
-
-			methodNames = lappend(methodNames, unconstify(char *, sortMethodName));
-		}
 		ExplainPropertyList("Sort Methods Used", methodNames, es);
 
 		if (groupInfo->maxMemorySpaceUsed > 0)
@@ -2834,15 +2842,14 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
 		return;
 
-	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
 	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
 	if (prefixsortGroupInfo->groupCount > 0)
-		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", true, es);
 
 	if (incrsortstate->shared_info != NULL)
 	{
 		int			n;
-		bool		opened_group = false;
 
 		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
 		{
@@ -2860,20 +2867,18 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 				prefixsortGroupInfo->groupCount == 0)
 				continue;
 
-			if (!opened_group)
-			{
-				ExplainOpenGroup("Workers", "Workers", false, es);
-				opened_group = true;
-			}
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
 
 			if (fullsortGroupInfo->groupCount > 0)
-				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", es);
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+												 es->workers_state == NULL, es);
 			if (prefixsortGroupInfo->groupCount > 0)
-				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", es);
-		}
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", true, es);
 
-		if (opened_group)
-			ExplainCloseGroup("Workers", "Workers", false, es);
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
 	}
 }
 
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 9fe93d5979..6c683538ff 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -85,6 +85,30 @@
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
 
+/*
+ * We need to store the instrumentation information in either local node's sort
+ * info or, for a parallel worker process, in the shared info (this avoids
+ * having to additionally memcpy the info from local memory to shared memory
+ * at each instrumentation call). This macro expands to choose the proper sort
+ * state and group info.
+ *
+ * Arguments:
+ * - node: type IncrementalSortState *
+ * - groupName: the token fullsort or prefixsort
+ */
+#define INSTRUMENT_SORT_GROUP(node, groupName) \
+	if (node->ss.ps.instrument != NULL) \
+	{ \
+		if (node->shared_info && node->am_worker) \
+		{ \
+			Assert(IsParallelWorker()); \
+			Assert(ParallelWorkerNumber <= node->shared_info->num_workers); \
+			instrumentSortedGroup(&node->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, node->groupName##_state); \
+		} else { \
+			instrumentSortedGroup(&node->incsort_info.groupName##GroupInfo, node->groupName##_state); \
+		} \
+	}
+
 /* ----------------------------------------------------------------
  * instrumentSortedGroup
  *
@@ -94,12 +118,10 @@
  * ----------------------------------------------------------------
  */
 static void
-instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
+instrumentSortedGroup(IncrementalSortGroupInfo *groupInfo,
 					  Tuplesortstate *sortState)
 {
-	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
 	TuplesortInstrumentation sort_instr;
-
 	groupInfo->groupCount++;
 
 	tuplesort_get_stats(sortState, &sort_instr);
@@ -122,19 +144,7 @@ instrumentSortedGroup(PlanState *pstate, IncrementalSortGroupInfo *groupInfo,
 	}
 
 	/* Track each sort method we've used. */
-	if (!list_member_int(groupInfo->sortMethods, sort_instr.sortMethod))
-		groupInfo->sortMethods = lappend_int(groupInfo->sortMethods,
-											 sort_instr.sortMethod);
-
-	/* Record shared stats if we're a parallel worker. */
-	if (node->shared_info && node->am_worker)
-	{
-		Assert(IsParallelWorker());
-		Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
-
-		memcpy(&node->shared_info->sinfo[ParallelWorkerNumber],
-			   &node->incsort_info, sizeof(IncrementalSortInfo));
-	}
+	groupInfo->sortMethods |= sort_instr.sortMethod;
 }
 
 /* ----------------------------------------------------------------
@@ -434,10 +444,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
 		tuplesort_performsort(node->prefixsort_state);
 
-		if (pstate->instrument != NULL)
-			instrumentSortedGroup(pstate,
-								  &node->incsort_info.prefixsortGroupInfo,
-								  node->prefixsort_state);
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
 
 		if (node->bounded)
 		{
@@ -695,10 +702,7 @@ ExecIncrementalSort(PlanState *pstate)
 				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
 				tuplesort_performsort(fullsort_state);
 
-				if (pstate->instrument != NULL)
-					instrumentSortedGroup(pstate,
-										  &node->incsort_info.fullsortGroupInfo,
-										  fullsort_state);
+				INSTRUMENT_SORT_GROUP(node, fullsort)
 
 				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
 				node->execution_status = INCSORT_READFULLSORT;
@@ -779,10 +783,7 @@ ExecIncrementalSort(PlanState *pstate)
 							   nTuples);
 					tuplesort_performsort(fullsort_state);
 
-					if (pstate->instrument != NULL)
-						instrumentSortedGroup(pstate,
-											  &node->incsort_info.fullsortGroupInfo,
-											  fullsort_state);
+					INSTRUMENT_SORT_GROUP(node, fullsort)
 
 					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
 					node->execution_status = INCSORT_READFULLSORT;
@@ -821,10 +822,8 @@ ExecIncrementalSort(PlanState *pstate)
 				 */
 				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
 				tuplesort_performsort(fullsort_state);
-				if (pstate->instrument != NULL)
-					instrumentSortedGroup(pstate,
-										  &node->incsort_info.fullsortGroupInfo,
-										  fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
 
 				/*
 				 * If the full sort tuplesort happened to switch into top-n
@@ -937,10 +936,7 @@ ExecIncrementalSort(PlanState *pstate)
 		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
 		tuplesort_performsort(node->prefixsort_state);
 
-		if (pstate->instrument != NULL)
-			instrumentSortedGroup(pstate,
-								  &node->incsort_info.prefixsortGroupInfo,
-								  node->prefixsort_state);
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
 
 		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
 		node->execution_status = INCSORT_READPREFIXSORT;
@@ -1026,13 +1022,13 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 		fullsortGroupInfo->totalDiskSpaceUsed = 0;
 		fullsortGroupInfo->maxMemorySpaceUsed = 0;
 		fullsortGroupInfo->totalMemorySpaceUsed = 0;
-		fullsortGroupInfo->sortMethods = NIL;
+		fullsortGroupInfo->sortMethods = 0;
 		prefixsortGroupInfo->groupCount = 0;
 		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
 		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
 		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
 		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
-		prefixsortGroupInfo->sortMethods = NIL;
+		prefixsortGroupInfo->sortMethods = 0;
 	}
 
 	/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6127ab5912..8d1b944472 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2034,7 +2034,7 @@ typedef struct IncrementalSortGroupInfo
 	long		totalDiskSpaceUsed;
 	long		maxMemorySpaceUsed;
 	long		totalMemorySpaceUsed;
-	List	   *sortMethods;
+	Size		sortMethods; /* bitmask of TuplesortMethod */
 } IncrementalSortGroupInfo;
 
 typedef struct IncrementalSortInfo
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 0e9ab4e586..96e970339c 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -61,14 +61,17 @@ typedef struct SortCoordinateData *SortCoordinate;
  * Data structures for reporting sort statistics.  Note that
  * TuplesortInstrumentation can't contain any pointers because we
  * sometimes put it in shared memory.
+ *
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
  */
 typedef enum
 {
 	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_TOP_N_HEAPSORT = 2,
+	SORT_TYPE_QUICKSORT = 4,
+	SORT_TYPE_EXTERNAL_SORT = 8,
+	SORT_TYPE_EXTERNAL_MERGE = 16
 } TuplesortMethod;
 
 typedef enum
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index ebb8412237..9a9cb9f28c 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -531,13 +531,13 @@ select * from (select * from t order by a) s order by a, b limit 55;
 
 -- Test EXPLAIN ANALYZE with only a fullsort group.
 select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
-                                 explain_analyze_without_memory                                  
--------------------------------------------------------------------------------------------------
+                                 explain_analyze_without_memory                                 
+------------------------------------------------------------------------------------------------
  Limit (actual rows=55 loops=1)
    ->  Incremental Sort (actual rows=55 loops=1)
          Sort Key: t.a, t.b
          Presorted Key: t.a
-         Full-sort Groups: 2 (Methods: quicksort, top-N heapsort) Memory: NNkB (avg), NNkB (max)
+         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
          ->  Sort (actual rows=100 loops=1)
                Sort Key: t.a
                Sort Method: quicksort  Memory: NNkB
@@ -563,8 +563,8 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
          "Full-sort Groups": {                   +
              "Group Count": 2,                   +
              "Sort Methods Used": [              +
-                 "quicksort",                    +
-                 "top-N heapsort"                +
+                 "top-N heapsort",               +
+                 "quicksort"                     +
              ],                                  +
              "Sort Space Memory": {              +
                  "Average Sort Space Used": "NN",+
@@ -705,14 +705,14 @@ select * from t left join (select * from (select * from t order by a) v order by
 rollback;
 -- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
 select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
-                         explain_analyze_without_memory                          
----------------------------------------------------------------------------------
+                        explain_analyze_without_memory                         
+-------------------------------------------------------------------------------
  Limit (actual rows=70 loops=1)
    ->  Incremental Sort (actual rows=70 loops=1)
          Sort Key: t.a, t.b
          Presorted Key: t.a
-         Full-sort Groups: 1 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
-         Presorted Groups: 5 (Methods: quicksort) Memory: NNkB (avg), NNkB (max)
+         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
+         Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
          ->  Sort (actual rows=100 loops=1)
                Sort Key: t.a
                Sort Method: quicksort  Memory: NNkB
-- 
2.21.1

v46-0004-Consider-incremental-sort-paths-in-additional-places.patchtext/plain; charset=us-asciiDownload

From 4959c782bbfea165876754021ad9b9287898edd6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH 4/4] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 --
 src/backend/optimizer/geqo/geqo_eval.c  |   2 +-
 src/backend/optimizer/path/allpaths.c   | 208 +++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++
 src/backend/optimizer/plan/planner.c    | 346 +++++++++++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 6 files changed, 580 insertions(+), 36 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..93d967e812 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,210 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future. For example, we might want to consider pathkeys useful for
+ * merge joins.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		ListCell   *lc;
+		List	   *pathkeys = NIL;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+
+			/*
+			 * We can only build an Incremental Sort for pathkeys which contain
+			 * an EC member in the current relation, so ignore any suffix of the
+			 * list as soon as we find a pathkey without an EC member the
+			 * relation.
+			 *
+			 * By still returning the prefix of the pathkeys list that does meet
+			 * criteria of EC membership in the current relation, we enable not
+			 * just an incremental sort on the entirety of query_pathkeys but
+			 * also incremental sort below a JOIN.
+			 */
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
+				break;
+
+			pathkeys = lappend(pathkeys, pathkey);
+		}
+
+		if (pathkeys)
+			useful_pathkeys_list = lappend(useful_pathkeys_list, pathkeys);
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3103,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 753e23676b..fb094e3be0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5077,6 +5077,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6431,7 +6492,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6490,6 +6553,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, no point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6501,12 +6638,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6537,6 +6680,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6808,6 +7001,58 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/* Consider incremental sort on all partial paths, if enabled. */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6816,7 +7061,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6851,6 +7098,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -6948,10 +7245,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -6977,6 +7275,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7078,7 +7416,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
@@ -7232,7 +7570,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..665f4065a4 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.21.1

#252

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#251)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Mar 29, 2020 at 9:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

Attached is a slightly reorganized patch series. I've merged the fixes
into the appropriate matches, and I've also combined the two patches
adding incremental sort paths to additional places in planner.

A couple more comments:

1) I think the GUC documentation in src/sgml/config.sgml is a bit too
detailed, compared to the other enable_* GUCs. I wonder if there's a
better place where to move the details. What about adding some examples
and explanation to perform.sgml?

I'll take a look at that and include in a patch series tomorrow.

2) Looking at the explain output, the verbose mode looks like this:

test=# explain (verbose, analyze) select a from t order by a, b, c;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=66.31..816072.71 rows=8333226 width=24) (actual time=4.787..20092.555 rows=10000000 loops=1)
Output: a, b, c
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=66.28..729200.36 rows=4166613 width=24) (actual time=1.308..14021.575 rows=3333333 loops=3)
Output: a, b, c
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB peak=138kB
Worker 0: actual time=0.766..16122.368 rows=3841573 loops=1
Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB peak=141kB
Worker 1: actual time=1.986..16189.831 rows=3845490 loops=1
Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB peak=139kB
-> Parallel Index Scan using t_a_b_idx on public.t (cost=0.43..382365.92 rows=4166613 width=24) (actual time=0.040..9808.449 rows=3333333 loops=3)
Output: a, b, c
Worker 0: actual time=0.048..11275.178 rows=3841573 loops=1
Worker 1: actual time=0.041..11314.133 rows=3845490 loops=1
Planning Time: 0.166 ms
Execution Time: 25135.029 ms
(22 rows)

There seems to be missing indentation for the first line of worker info.

Working on that too.

I'm still not quite convinced we should be printing two lines - I know
you mentioned the lines might be too long, but see how long the other
lines may get ...

All right, I give in :)

Do you think non-workers (both the leader and non-parallel plans)
should also move to one line?

3) I see the new nodes (plan state, ...) have "presortedCols" which does
not indicate it's a "number of". I think we usually prefix names of such
fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)

I can fix this too.

Also I noticed a few compiler warnings I'll fixup in tomorrow's reply.

My TODO for this patch is this:

- review the costing (I think the estimates are OK, but I recall I
haven't been entirely happy with how it's broken into functions.)

- review the tuplesort changes (the memory contexts etc.)

- do more testing of performance impact on planning

Sounds good.

Thanks,
James

#253

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#252)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Mar 29, 2020 at 10:16:53PM -0400, James Coleman wrote:

On Sun, Mar 29, 2020 at 9:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

Attached is a slightly reorganized patch series. I've merged the fixes
into the appropriate matches, and I've also combined the two patches
adding incremental sort paths to additional places in planner.

A couple more comments:

1) I think the GUC documentation in src/sgml/config.sgml is a bit too
detailed, compared to the other enable_* GUCs. I wonder if there's a
better place where to move the details. What about adding some examples
and explanation to perform.sgml?

I'll take a look at that and include in a patch series tomorrow.

2) Looking at the explain output, the verbose mode looks like this:

test=# explain (verbose, analyze) select a from t order by a, b, c;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=66.31..816072.71 rows=8333226 width=24) (actual time=4.787..20092.555 rows=10000000 loops=1)
Output: a, b, c
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=66.28..729200.36 rows=4166613 width=24) (actual time=1.308..14021.575 rows=3333333 loops=3)
Output: a, b, c
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB peak=138kB
Worker 0: actual time=0.766..16122.368 rows=3841573 loops=1
Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB peak=141kB
Worker 1: actual time=1.986..16189.831 rows=3845490 loops=1
Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB peak=139kB
-> Parallel Index Scan using t_a_b_idx on public.t (cost=0.43..382365.92 rows=4166613 width=24) (actual time=0.040..9808.449 rows=3333333 loops=3)
Output: a, b, c
Worker 0: actual time=0.048..11275.178 rows=3841573 loops=1
Worker 1: actual time=0.041..11314.133 rows=3845490 loops=1
Planning Time: 0.166 ms
Execution Time: 25135.029 ms
(22 rows)

There seems to be missing indentation for the first line of worker info.

Working on that too.

I'm still not quite convinced we should be printing two lines - I know
you mentioned the lines might be too long, but see how long the other
lines may get ...

All right, I give in :)

Do you think non-workers (both the leader and non-parallel plans)
should also move to one line?

I think we should use the same formatting for both cases, so yes.

FWIW I forgot to mention I tweaked the INSTRUMENT_SORT_GROUP macro a
bit, by moving the if condition in it. That makes the calls easier.

3) I see the new nodes (plan state, ...) have "presortedCols" which does
not indicate it's a "number of". I think we usually prefix names of such
fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)

I can fix this too.

Also I noticed a few compiler warnings I'll fixup in tomorrow's reply.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#254

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#253)

6 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Mar 30, 2020 at 8:24 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sun, Mar 29, 2020 at 10:16:53PM -0400, James Coleman wrote:

On Sun, Mar 29, 2020 at 9:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

Attached is a slightly reorganized patch series. I've merged the fixes
into the appropriate matches, and I've also combined the two patches
adding incremental sort paths to additional places in planner.

A couple more comments:

1) I think the GUC documentation in src/sgml/config.sgml is a bit too
detailed, compared to the other enable_* GUCs. I wonder if there's a
better place where to move the details. What about adding some examples
and explanation to perform.sgml?

I'll take a look at that and include in a patch series tomorrow.

Attached.

2) Looking at the explain output, the verbose mode looks like this:

test=# explain (verbose, analyze) select a from t order by a, b, c;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=66.31..816072.71 rows=8333226 width=24) (actual time=4.787..20092.555 rows=10000000 loops=1)
Output: a, b, c
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=66.28..729200.36 rows=4166613 width=24) (actual time=1.308..14021.575 rows=3333333 loops=3)
Output: a, b, c
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB peak=138kB
Worker 0: actual time=0.766..16122.368 rows=3841573 loops=1
Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB peak=141kB
Worker 1: actual time=1.986..16189.831 rows=3845490 loops=1
Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB peak=139kB
-> Parallel Index Scan using t_a_b_idx on public.t (cost=0.43..382365.92 rows=4166613 width=24) (actual time=0.040..9808.449 rows=3333333 loops=3)
Output: a, b, c
Worker 0: actual time=0.048..11275.178 rows=3841573 loops=1
Worker 1: actual time=0.041..11314.133 rows=3845490 loops=1
Planning Time: 0.166 ms
Execution Time: 25135.029 ms
(22 rows)

There seems to be missing indentation for the first line of worker info.

Working on that too.

See attached. I've folded in the original "explain fixes" patch into
the main series, and the "explain fixes" patch in this series contains
only the changes for the above.

I'm still not quite convinced we should be printing two lines - I know
you mentioned the lines might be too long, but see how long the other
lines may get ...

All right, I give in :)

Do you think non-workers (both the leader and non-parallel plans)
should also move to one line?

I think we should use the same formatting for both cases, so yes.

FWIW I forgot to mention I tweaked the INSTRUMENT_SORT_GROUP macro a
bit, by moving the if condition in it. That makes the calls easier.

Ah, that actually fixed some of the compile warnings. The other is
fixed in my explain fixes patch.

3) I see the new nodes (plan state, ...) have "presortedCols" which does
not indicate it's a "number of". I think we usually prefix names of such
fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)

I can fix this too.

Changed everywhere we used this var name. I'm tempted to change to
nPresortedKeys, but a cursory glance suggests some cases might
actually be consistent with other var names reference columns, so I'm
not sure if we want to go down that path (and change more than just
this).

Also I noticed a few compiler warnings I'll fixup in tomorrow's reply.

OK

Mentioned above.

Thanks,
James

Attachments:

v47-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v47-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 076519a9a84db0087d41348a909ec167e1a39f64 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v47 1/6] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v47-0004-docs-updates.patchtext/x-patch; charset=US-ASCII; name=v47-0004-docs-updates.patchDownload

From e41124b17b54d297fa1788a7053cbfa6e9a39e40 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Mon, 30 Mar 2020 18:44:13 -0400
Subject: [PATCH v47 4/6] docs updates

---
 doc/src/sgml/config.sgml  | 12 ++---------
 doc/src/sgml/perform.sgml | 42 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 47ceea43d9..043c765264 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4550,16 +4550,8 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </term>
       <listitem>
        <para>
-        Enables or disables the query planner's use of incremental sort, which
-        allows the planner to take advantage of data presorted on columns
-        <literal>1..m</literal> when an ordering on columns <literal>1..n</literal>
-        (where <literal>m < n</literal>) is required. Compared to regular sorts,
-        incremental sort allows returning tuples before the entire result set
-        has been sorted, particularly enabling optimizations with
-        <literal>LIMIT</literal> queries. It may also reduce memory usage and
-        the likelihood of spilling sorts to disk, but comes at the cost of
-        increased overhead splitting the result set into multiple sorting
-        batches. The default is <literal>on</literal>.
+        Enables or disables the query planner's use of incremental sort steps.
+        The default is <literal>on</literal>.
        </para>
       </listitem>
      </varlistentry>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index ab090441cf..ee8933861c 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -291,7 +291,47 @@ EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42;
     often see this plan type for queries that fetch just a single row.  It's
     also often used for queries that have an <literal>ORDER BY</literal> condition
     that matches the index order, because then no extra sorting step is needed
-    to satisfy the <literal>ORDER BY</literal>.
+    to satisfy the <literal>ORDER BY</literal>.  In this example, adding
+    <literal>ORDER BY unique1</literal> would use the same plan because the
+    index already implicitly provides the requested ordering.
+   </para>
+
+   <para>
+     The planner may implement an <literal>ORDER BY</literal> clause in several
+     ways.  The above example shows that such an ordering clause may be
+     implemented implicitly.  The planner may also add an explicit
+     <literal>sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY unique1;
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort  (cost=1109.39..1134.39 rows=10000 width=244)
+   Sort Key: unique1
+   ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
+</screen>
+
+    If the a part of the plan guarantess an ordering on a prefix of the
+    required sort keys, then the planner may instead decide to use an
+    <literal>incremental sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY four, ten LIMIT 100;
+                                              QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ Limit  (cost=521.06..538.05 rows=100 width=244)
+   ->  Incremental Sort  (cost=521.06..2220.95 rows=10000 width=244)
+         Sort Key: four, ten
+         Presorted Key: four
+         ->  Index Scan using index_tenk1_on_four on tenk1  (cost=0.29..1510.08 rows=10000 width=244)
+</screen>
+
+    Compared to regular sorts, sorting incrementally allows returning tuples
+    before the entire result set has been sorted, which particularly enables
+    optimizations with <literal>LIMIT</literal> queries.  It may also reduce
+    memory usage and the likelihood of spilling sorts to disk, but it comes at
+    the cost of the increased overhead of splitting the result set into multiple
+    sorting batches.
    </para>
 
    <para>
-- 
2.17.1

v47-0003-explain-fixes.patchtext/x-patch; charset=US-ASCII; name=v47-0003-explain-fixes.patchDownload

From 78f708bfad7eeed251438ea2d0a59ce674693f89 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Mon, 30 Mar 2020 18:39:09 -0400
Subject: [PATCH v47 3/6] explain fixes

---
 src/backend/commands/explain.c                | 25 +++++++++++++------
 .../regress/expected/incremental_sort.out     |  9 +++----
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 799c6c44d2..cd2d81712f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2774,8 +2774,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 							 spaceTypeName, avgSpace,
 							 groupInfo->maxDiskSpaceUsed);
 		}
-
-		appendStringInfo(es->str, "\n");
 	}
 	else
 	{
@@ -2845,11 +2843,18 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
 	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
 	if (prefixsortGroupInfo->groupCount > 0)
-		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", true, es);
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+			appendStringInfo(es->str, " ");
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+	}
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+		appendStringInfo(es->str, "\n");
 
 	if (incrsortstate->shared_info != NULL)
 	{
 		int			n;
+		bool		indent_first_line;
 
 		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
 		{
@@ -2870,11 +2875,17 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 			if (es->workers_state)
 				ExplainOpenWorker(n, es);
 
-			if (fullsortGroupInfo->groupCount > 0)
-				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
-												 es->workers_state == NULL, es);
+			indent_first_line = es->workers_state == NULL || es->verbose;
+			show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+											 indent_first_line, es);
 			if (prefixsortGroupInfo->groupCount > 0)
-				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", true, es);
+			{
+				if (es->format == EXPLAIN_FORMAT_TEXT)
+					appendStringInfo(es->str, " ");
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+			}
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "\n");
 
 			if (es->workers_state)
 				ExplainCloseWorker(n, es);
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 9a9cb9f28c..288a5b2101 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -705,19 +705,18 @@ select * from t left join (select * from (select * from t order by a) v order by
 rollback;
 -- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
 select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
-                        explain_analyze_without_memory                         
--------------------------------------------------------------------------------
+                                                           explain_analyze_without_memory                                                            
+-----------------------------------------------------------------------------------------------------------------------------------------------------
  Limit (actual rows=70 loops=1)
    ->  Incremental Sort (actual rows=70 loops=1)
          Sort Key: t.a, t.b
          Presorted Key: t.a
-         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
-         Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
+         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
          ->  Sort (actual rows=100 loops=1)
                Sort Key: t.a
                Sort Method: quicksort  Memory: NNkB
                ->  Seq Scan on t (actual rows=100 loops=1)
-(10 rows)
+(9 rows)
 
 select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
                    jsonb_pretty                   
-- 
2.17.1

v47-0005-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v47-0005-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From af9967a112099a35444dac100dcdba1ae72aa433 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v47 5/6] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 --
 src/backend/optimizer/geqo/geqo_eval.c  |   2 +-
 src/backend/optimizer/path/allpaths.c   | 208 +++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++
 src/backend/optimizer/plan/planner.c    | 346 +++++++++++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 6 files changed, 580 insertions(+), 36 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..93d967e812 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,210 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future. For example, we might want to consider pathkeys useful for
+ * merge joins.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		ListCell   *lc;
+		List	   *pathkeys = NIL;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+
+			/*
+			 * We can only build an Incremental Sort for pathkeys which contain
+			 * an EC member in the current relation, so ignore any suffix of the
+			 * list as soon as we find a pathkey without an EC member the
+			 * relation.
+			 *
+			 * By still returning the prefix of the pathkeys list that does meet
+			 * criteria of EC membership in the current relation, we enable not
+			 * just an incremental sort on the entirety of query_pathkeys but
+			 * also incremental sort below a JOIN.
+			 */
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
+				break;
+
+			pathkeys = lappend(pathkeys, pathkey);
+		}
+
+		if (pathkeys)
+			useful_pathkeys_list = lappend(useful_pathkeys_list, pathkeys);
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3103,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 423ac25827..881302d0a3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5077,6 +5077,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6431,7 +6492,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6490,6 +6553,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, no point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6501,12 +6638,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6537,6 +6680,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6808,6 +7001,58 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/* Consider incremental sort on all partial paths, if enabled. */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6816,7 +7061,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6851,6 +7098,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -6948,10 +7245,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -6977,6 +7275,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7078,7 +7416,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
@@ -7232,7 +7570,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..665f4065a4 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.17.1

v47-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v47-0002-Implement-incremental-sort.patchDownload

From 0157bc9f4ae71f2a503de4ae71324cfc549c11b8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v47 2/6] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   22 +
 src/backend/commands/explain.c                |  228 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1263 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   74 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  306 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |   14 +-
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1400 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 40 files changed, 4149 insertions(+), 164 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 70854ae298..47ceea43d9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4542,6 +4542,28 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort, which
+        allows the planner to take advantage of data presorted on columns
+        <literal>1..m</literal> when an ordering on columns <literal>1..n</literal>
+        (where <literal>m < n</literal>) is required. Compared to regular sorts,
+        incremental sort allows returning tuples before the entire result set
+        has been sorted, particularly enabling optimizations with
+        <literal>LIMIT</literal> queries. It may also reduce memory usage and
+        the likelihood of spilling sorts to disk, but comes at the cost of
+        increased overhead splitting the result set into multiple sorting
+        batches. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 58141d8393..799c6c44d2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,185 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, bool indent, ExplainState *es)
+{
+	ListCell   *methodCell;
+	List	   *methodNames = NIL;
+
+	/* Generate a list of sort methods used across all groups. */
+	for (int bit = 0; bit < sizeof(Size); ++bit)
+	{
+		if (groupInfo->sortMethods & (1 << bit))
+		{
+			TuplesortMethod sortMethod = (1 << bit);
+			const char *methodName;
+
+			methodName = tuplesort_method_name(sortMethod);
+			methodNames = lappend(methodNames, unconstify(char *, methodName));
+		}
+	}
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (indent)
+			appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld Sort Method", groupLabel,
+						 groupInfo->groupCount);
+		/* plural/singular based on methodNames size */
+		if (list_length(methodNames) > 1)
+			appendStringInfo(es->str, "s: ");
+		else
+			appendStringInfo(es->str, ": ");
+		foreach(methodCell, methodNames)
+		{
+			appendStringInfo(es->str, "%s", (char *) methodCell->ptr_value);
+			if (foreach_current_index(methodCell) < list_length(methodNames) - 1)
+				appendStringInfo(es->str, ", ");
+		}
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+
+		appendStringInfo(es->str, "\n");
+	}
+	else
+	{
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", true, es);
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
+
+			if (fullsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+												 es->workers_state == NULL, es);
+			if (prefixsortGroupInfo->groupCount > 0)
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", true, es);
+
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..6c683538ff
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1263 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * We need to store the instrumentation information in either local node's sort
+ * info or, for a parallel worker process, in the shared info (this avoids
+ * having to additionally memcpy the info from local memory to shared memory
+ * at each instrumentation call). This macro expands to choose the proper sort
+ * state and group info.
+ *
+ * Arguments:
+ * - node: type IncrementalSortState *
+ * - groupName: the token fullsort or prefixsort
+ */
+#define INSTRUMENT_SORT_GROUP(node, groupName) \
+	if (node->ss.ps.instrument != NULL) \
+	{ \
+		if (node->shared_info && node->am_worker) \
+		{ \
+			Assert(IsParallelWorker()); \
+			Assert(ParallelWorkerNumber <= node->shared_info->num_workers); \
+			instrumentSortedGroup(&node->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, node->groupName##_state); \
+		} else { \
+			instrumentSortedGroup(&node->incsort_info.groupName##GroupInfo, node->groupName##_state); \
+		} \
+	}
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	TuplesortInstrumentation sort_instr;
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	groupInfo->sortMethods |= sort_instr.sortMethod;
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->presortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->presortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			presortedCols;
+
+	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = presortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			presortedCols = plannode->presortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - presortedCols,
+												&(plannode->sort.sortColIdx[presortedCols]),
+												&(plannode->sort.sortOperators[presortedCols]),
+												&(plannode->sort.collations[presortedCols]),
+												&(plannode->sort.nullsFirst[presortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					INSTRUMENT_SORT_GROUP(node, fullsort)
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = 0;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = 0;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..e21f48327d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(presortedCols);
 
 	return newnode;
 }
@@ -4895,6 +4929,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..6c83372c9f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3783,6 +3799,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..c5bbbf459e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(presortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b61d..a59926fa02 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..3b84feaf7b 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,49 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1829,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..1d7d4eb3e7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int presortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->presortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->presortedCols = presortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int presortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5da0528382..423ac25827 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4922,13 +4922,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4962,29 +4965,66 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..e20c055dea 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->presortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af876d1f01..b6ce724557 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aa44f0c9bf..bc2c2dbb1b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -359,6 +359,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..cc33a85731 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1296,23 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1370,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2764,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2814,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3311,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50f09..8d1b944472 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1980,6 +1980,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2008,6 +2023,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instruementation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	Size		sortMethods; /* bitmask of TuplesortMethod */
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..28d580dd3c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1620,6 +1620,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..136d794219 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			presortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..96e970339c 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -61,14 +61,17 @@ typedef struct SortCoordinateData *SortCoordinate;
  * Data structures for reporting sort statistics.  Note that
  * TuplesortInstrumentation can't contain any pointers because we
  * sometimes put it in shared memory.
+ *
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
  */
 typedef enum
 {
 	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_TOP_N_HEAPSORT = 2,
+	SORT_TYPE_QUICKSORT = 4,
+	SORT_TYPE_EXTERNAL_SORT = 8,
+	SORT_TYPE_EXTERNAL_MERGE = 16
 } TuplesortMethod;
 
 typedef enum
@@ -215,6 +218,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +243,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..9a9cb9f28c
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1400 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                 
+------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "top-N heapsort",               +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                        explain_analyze_without_memory                         
+-------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
+         Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(10 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..4425853572 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

v47-0006-nPresortedCols.patchtext/x-patch; charset=US-ASCII; name=v47-0006-nPresortedCols.patchDownload

From b21dd894a728565df7cbdd41c063c00a21fb5975 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Mon, 30 Mar 2020 18:51:53 -0400
Subject: [PATCH v47 6/6] nPresortedCols

---
 src/backend/commands/explain.c             |  2 +-
 src/backend/executor/nodeIncrementalSort.c | 22 +++++++++++-----------
 src/backend/nodes/copyfuncs.c              |  2 +-
 src/backend/nodes/outfuncs.c               |  2 +-
 src/backend/nodes/readfuncs.c              |  2 +-
 src/backend/optimizer/plan/createplan.c    | 16 ++++++++--------
 src/backend/optimizer/util/pathnode.c      |  2 +-
 src/include/nodes/pathnodes.h              |  2 +-
 src/include/nodes/plannodes.h              |  2 +-
 9 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index cd2d81712f..02a81bebc3 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2256,7 +2256,7 @@ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
 	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
-						 plan->sort.numCols, plan->presortedCols,
+						 plan->sort.numCols, plan->nPresortedCols,
 						 plan->sort.sortColIdx,
 						 plan->sort.sortOperators, plan->sort.collations,
 						 plan->sort.nullsFirst,
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 6c683538ff..bcab7c054c 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -159,11 +159,11 @@ preparePresortedCols(IncrementalSortState *node)
 	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
 
 	node->presorted_keys =
-		(PresortedKeyData *) palloc(plannode->presortedCols *
+		(PresortedKeyData *) palloc(plannode->nPresortedCols *
 									sizeof(PresortedKeyData));
 
 	/* Pre-cache comparison functions for each pre-sorted key. */
-	for (int i = 0; i < plannode->presortedCols; i++)
+	for (int i = 0; i < plannode->nPresortedCols; i++)
 	{
 		Oid			equalityOp,
 					equalityFunc;
@@ -204,9 +204,9 @@ preparePresortedCols(IncrementalSortState *node)
 static bool
 isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
 {
-	int			presortedCols;
+	int			nPresortedCols;
 
-	presortedCols = castNode(IncrementalSort, node->ss.ps.plan)->presortedCols;
+	nPresortedCols = castNode(IncrementalSort, node->ss.ps.plan)->nPresortedCols;
 
 	/*
 	 * That the input is sorted by keys * (0, ... n) implies that the tail
@@ -214,7 +214,7 @@ isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot
 	 * from the last pre-sorted column to optimize for early detection of
 	 * inequality and minimizing the number of function calls..
 	 */
-	for (int i = presortedCols - 1; i >= 0; i--)
+	for (int i = nPresortedCols - 1; i >= 0; i--)
 	{
 		Datum		datumA,
 					datumB,
@@ -295,18 +295,18 @@ switchToPresortedPrefixMode(PlanState *pstate)
 	if (node->prefixsort_state == NULL)
 	{
 		Tuplesortstate *prefixsort_state;
-		int			presortedCols = plannode->presortedCols;
+		int			nPresortedCols = plannode->nPresortedCols;
 
 		/*
 		 * Optimize the sort by assuming the prefix columns are all equal and
 		 * thus we only need to sort by any remaining columns.
 		 */
 		prefixsort_state = tuplesort_begin_heap(tupDesc,
-												plannode->sort.numCols - presortedCols,
-												&(plannode->sort.sortColIdx[presortedCols]),
-												&(plannode->sort.sortOperators[presortedCols]),
-												&(plannode->sort.collations[presortedCols]),
-												&(plannode->sort.nullsFirst[presortedCols]),
+												plannode->sort.numCols - nPresortedCols,
+												&(plannode->sort.sortColIdx[nPresortedCols]),
+												&(plannode->sort.sortOperators[nPresortedCols]),
+												&(plannode->sort.collations[nPresortedCols]),
+												&(plannode->sort.nullsFirst[nPresortedCols]),
 												work_mem,
 												NULL,
 												false);
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index e21f48327d..f5977b0249 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -978,7 +978,7 @@ _copyIncrementalSort(const IncrementalSort *from)
 	/*
 	 * copy remainder of node
 	 */
-	COPY_SCALAR_FIELD(presortedCols);
+	COPY_SCALAR_FIELD(nPresortedCols);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 6c83372c9f..dfe5fb4867 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -863,7 +863,7 @@ _outIncrementalSort(StringInfo str, const IncrementalSort *node)
 
 	_outSortInfo(str, (const Sort *) node);
 
-	WRITE_INT_FIELD(presortedCols);
+	WRITE_INT_FIELD(nPresortedCols);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index c5bbbf459e..2a2f39bf04 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2190,7 +2190,7 @@ _readIncrementalSort(void)
 
 	ReadCommonSort(&local_node->sort);
 
-	READ_INT_FIELD(presortedCols);
+	READ_INT_FIELD(nPresortedCols);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 1d7d4eb3e7..5be9135646 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -247,7 +247,7 @@ static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
 static IncrementalSort *make_incrementalsort(Plan *lefttree,
-											 int numCols, int presortedCols,
+											 int numCols, int nPresortedCols,
 											 AttrNumber *sortColIdx, Oid *sortOperators,
 											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
@@ -265,7 +265,7 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
 static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
-														   List *pathkeys, Relids relids, int presortedCols);
+														   List *pathkeys, Relids relids, int nPresortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -2026,7 +2026,7 @@ create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 											  best_path->spath.path.pathkeys,
 											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
 											  best_path->spath.path.parent->relids : NULL,
-											  best_path->presortedCols);
+											  best_path->nPresortedCols);
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
@@ -5749,7 +5749,7 @@ make_sort(Plan *lefttree, int numCols,
  * nullsFirst arrays already.
  */
 static IncrementalSort *
-make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+make_incrementalsort(Plan *lefttree, int numCols, int nPresortedCols,
 					 AttrNumber *sortColIdx, Oid *sortOperators,
 					 Oid *collations, bool *nullsFirst)
 {
@@ -5763,7 +5763,7 @@ make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
 	plan->righttree = NULL;
-	node->presortedCols = presortedCols;
+	node->nPresortedCols = nPresortedCols;
 	node->sort.numCols = numCols;
 	node->sort.sortColIdx = sortColIdx;
 	node->sort.sortOperators = sortOperators;
@@ -6126,11 +6126,11 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
  *	  'lefttree' is the node which yields input tuples
  *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
  *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
- *	  'presortedCols' is the number of presorted columns in input tuples
+ *	  'nPresortedCols' is the number of presorted columns in input tuples
  */
 static IncrementalSort *
 make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
-								   Relids relids, int presortedCols)
+								   Relids relids, int nPresortedCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -6150,7 +6150,7 @@ make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+	return make_incrementalsort(lefttree, numsortkeys, nPresortedCols,
 								sortColIdx, sortOperators,
 								collations, nullsFirst);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e20c055dea..5e752f64b9 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2796,7 +2796,7 @@ create_incremental_sort_path(PlannerInfo *root,
 						  0.0,	/* XXX comparison_cost shouldn't be 0? */
 						  work_mem, limit_tuples);
 
-	sort->presortedCols = presorted_keys;
+	sort->nPresortedCols = presorted_keys;
 
 	return pathnode;
 }
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 28d580dd3c..a7cf733951 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1626,7 +1626,7 @@ typedef struct SortPath
 typedef struct IncrementalSortPath
 {
 	SortPath	spath;
-	int			presortedCols;	/* number of presorted columns */
+	int			nPresortedCols;	/* number of presorted columns */
 } IncrementalSortPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 136d794219..be8ef54a1e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -781,7 +781,7 @@ typedef struct Sort
 typedef struct IncrementalSort
 {
 	Sort		sort;
-	int			presortedCols;	/* number of presorted columns */
+	int			nPresortedCols;	/* number of presorted columns */
 } IncrementalSort;
 
 /* ---------------
-- 
2.17.1

#255

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#254)

2 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Mar 30, 2020 at 06:53:47PM -0400, James Coleman wrote:

On Mon, Mar 30, 2020 at 8:24 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Sun, Mar 29, 2020 at 10:16:53PM -0400, James Coleman wrote:

On Sun, Mar 29, 2020 at 9:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

Attached is a slightly reorganized patch series. I've merged the fixes
into the appropriate matches, and I've also combined the two patches
adding incremental sort paths to additional places in planner.

A couple more comments:

1) I think the GUC documentation in src/sgml/config.sgml is a bit too
detailed, compared to the other enable_* GUCs. I wonder if there's a
better place where to move the details. What about adding some examples
and explanation to perform.sgml?

I'll take a look at that and include in a patch series tomorrow.

Attached.

2) Looking at the explain output, the verbose mode looks like this:

test=# explain (verbose, analyze) select a from t order by a, b, c;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=66.31..816072.71 rows=8333226 width=24) (actual time=4.787..20092.555 rows=10000000 loops=1)
Output: a, b, c
Workers Planned: 2
Workers Launched: 2
-> Incremental Sort (cost=66.28..729200.36 rows=4166613 width=24) (actual time=1.308..14021.575 rows=3333333 loops=3)
Output: a, b, c
Sort Key: t.a, t.b, t.c
Presorted Key: t.a, t.b
Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB peak=138kB
Worker 0: actual time=0.766..16122.368 rows=3841573 loops=1
Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB peak=141kB
Worker 1: actual time=1.986..16189.831 rows=3845490 loops=1
Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB peak=139kB
-> Parallel Index Scan using t_a_b_idx on public.t (cost=0.43..382365.92 rows=4166613 width=24) (actual time=0.040..9808.449 rows=3333333 loops=3)
Output: a, b, c
Worker 0: actual time=0.048..11275.178 rows=3841573 loops=1
Worker 1: actual time=0.041..11314.133 rows=3845490 loops=1
Planning Time: 0.166 ms
Execution Time: 25135.029 ms
(22 rows)

There seems to be missing indentation for the first line of worker info.

Working on that too.

See attached. I've folded in the original "explain fixes" patch into
the main series, and the "explain fixes" patch in this series contains
only the changes for the above.

Thanks. I'll take a look at those changes tomorrow.

I'm still not quite convinced we should be printing two lines - I know
you mentioned the lines might be too long, but see how long the other
lines may get ...

All right, I give in :)

Do you think non-workers (both the leader and non-parallel plans)
should also move to one line?

I think we should use the same formatting for both cases, so yes.

FWIW I forgot to mention I tweaked the INSTRUMENT_SORT_GROUP macro a
bit, by moving the if condition in it. That makes the calls easier.

Ah, that actually fixed some of the compile warnings. The other is
fixed in my explain fixes patch.

3) I see the new nodes (plan state, ...) have "presortedCols" which does
not indicate it's a "number of". I think we usually prefix names of such
fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)

I can fix this too.

Changed everywhere we used this var name. I'm tempted to change to
nPresortedKeys, but a cursory glance suggests some cases might
actually be consistent with other var names reference columns, so I'm
not sure if we want to go down that path (and change more than just
this).

Not sure. We use "sort keys" and "path keys" for this, but I think
"columns" is good enough.

The main thing I've been working on today is benchmarking how this
affects planning. And I'm seeing a regression that worries me a bit,
unfortunately.

The test I'm doing is pretty simple - build a small table with a bunch
of columns:

create table t (a int, b int, c int, d int, e int, f int, g int);

insert into t select 100*random(), 100*random(), 100*random(),
100*random(), 100*random(), 100*random(), 100*random()
from generate_series(1,100000) s(i);

and then a number of indexes on subsets of up to 3 columns, as generated
using the attached build-indexes.py script. And then run a bunch of
explains (so no actual execution) sorting the data by at least 4 columns
(to trigger incremental sort paths), measuring timing of the script.

I did a bunch of runs on current master and v46 with incremental sort
disabled and enabled, and the results look like this:

master off on
--------------------------
34.609 37.463 37.729

which means about 8-9% regression with incremental sort. Of course, this
is only for planning time, for execution the impact is going to be much
smaller. But it's still a bit annoying.

I've suspected this might be either due to the add_partial_path changes
or the patch adding incremental sort to additional places, so I tested
those parts individually and the answer is no - add_partial_path changes
have very small impact (~1%, which might be noise). The regression comes
mostly from the 0002 part that adds incremental sort. At least in this
particular test - different tests might behave differently, of course.

The annoying bit is that the overhead does not disappear after disabling
incremental sort. That suggests this is not merely due to considering
and creating higher number of paths, but due to something that happens
before we even look at the GUC ...

I think I've found one such place - if you look at compare_pathkeys, it
has this check right before the forboth() loop:

if (keys1 == keys2)
return PATHKEYS_EQUAL;

But with incremental sort we don't call pathkeys_contained_in, we call
pathkeys_common_contained_in instead. And that does not call
compare_pathkeys and does not have the simple equality check. Adding
the following check seems to cut the overhead in half, which is nice:

if (keys1 == keys2)
{
*n_common = list_length(keys1);
return true;
}

Not sure where the rest of the regression comes from yet.

Also, while looking at pathkeys_common_contained_in(), I've been a bit
puzzled why does is this correct:

return (key1 == NULL);

It's easy to not notice it's key1 and not keys1, so I suggest we add a
comment 'key1 == NULL' means we've processed whole keys1 list.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#256

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#254)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 2020-Mar-30, James Coleman wrote:

+/* ----------------
+ *	 Instruementation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	Size		sortMethods; /* bitmask of TuplesortMethod */
+} IncrementalSortGroupInfo;

There's a typo "Instruementation" in the comment, but I'm more surprised
that type Size is being used to store a bitmask. It looks weird to me.
Wouldn't it be more reasonable to use bits32 or some such? (I first
noticed this in the "sizeof(Size)" code that appears in the explain
code.)

OTOH, aesthetically it would seem to be better to define these values
using ones and increasing shifts (1 << 1 and so on), rather than powers
of two:

+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
*/
typedef enum
{
SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_TOP_N_HEAPSORT = 2,
+	SORT_TYPE_QUICKSORT = 4,
+	SORT_TYPE_EXTERNAL_SORT = 8,
+	SORT_TYPE_EXTERNAL_MERGE = 16
} TuplesortMethod;

I don't quite understand why you skipped "1". (Also, is the use of zero
a wise choice?)

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#257

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Alvaro Herrera (#256)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 12:31 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2020-Mar-30, James Coleman wrote:
+/* ----------------
+ *    Instruementation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+     int64           groupCount;
+     long            maxDiskSpaceUsed;
+     long            totalDiskSpaceUsed;
+     long            maxMemorySpaceUsed;
+     long            totalMemorySpaceUsed;
+     Size            sortMethods; /* bitmask of TuplesortMethod */
+} IncrementalSortGroupInfo;
There's a typo "Instruementation" in the comment, but I'm more surprised
that type Size is being used to store a bitmask. It looks weird to me.
Wouldn't it be more reasonable to use bits32 or some such? (I first
noticed this in the "sizeof(Size)" code that appears in the explain
code.)

I just didn't know about bits32; I'll change.

OTOH, aesthetically it would seem to be better to define these values
using ones and increasing shifts (1 << 1 and so on), rather than powers
of two:
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
*/
typedef enum
{
SORT_TYPE_STILL_IN_PROGRESS = 0,
-     SORT_TYPE_TOP_N_HEAPSORT,
-     SORT_TYPE_QUICKSORT,
-     SORT_TYPE_EXTERNAL_SORT,
-     SORT_TYPE_EXTERNAL_MERGE
+     SORT_TYPE_TOP_N_HEAPSORT = 2,
+     SORT_TYPE_QUICKSORT = 4,
+     SORT_TYPE_EXTERNAL_SORT = 8,
+     SORT_TYPE_EXTERNAL_MERGE = 16
} TuplesortMethod;
I don't quite understand why you skipped "1". (Also, is the use of zero
a wise choice?)

The assignment of 0 was already there, and there wasn't a comment to
indicate why. That ends up meaning we wouldn't display "still in
progress" as a type here, which is maybe desirable, but I'm honestly
not sure why it was that way originally. I'm curious if you have any
thoughts on it.

I knew some projects used increasing shifts, but I wasn't sure what
the preference was here. I'll correct.

James

#258

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: James Coleman (#257)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

James Coleman <jtc331@gmail.com> writes:

+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.

I don't quite understand why you skipped "1". (Also, is the use of zero
a wise choice?)

The assignment of 0 was already there, and there wasn't a comment to
indicate why. That ends up meaning we wouldn't display "still in
progress" as a type here, which is maybe desirable, but I'm honestly
not sure why it was that way originally. I'm curious if you have any
thoughts on it.

As things stood, the "= 0" was a no-op, since the first enum value
would've been that anyway. But if you're converting this set of symbols
to bits that can be OR'd together, it seems pretty strange to use zero,
because that can't be distinguished from "absence of any entry".

Perhaps the semantics are such that that's actually sensible, but it's
far from a straightforward remapping of the old enum.

regards, tom lane

#259

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tom Lane (#258)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

James Coleman <jtc331@gmail.com> writes:
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
I don't quite understand why you skipped "1". (Also, is the use of zero
a wise choice?)

The assignment of 0 was already there, and there wasn't a comment to
indicate why. That ends up meaning we wouldn't display "still in
progress" as a type here, which is maybe desirable, but I'm honestly
not sure why it was that way originally. I'm curious if you have any
thoughts on it.

As things stood, the "= 0" was a no-op, since the first enum value
would've been that anyway. But if you're converting this set of symbols
to bits that can be OR'd together, it seems pretty strange to use zero,
because that can't be distinguished from "absence of any entry".

Perhaps the semantics are such that that's actually sensible, but it's
far from a straightforward remapping of the old enum.

Right, I didn't see the explicit "= 0" in other enums there, so it
made me wonder if it was intentional to designate that one had to be
0, but I guess without a comment that's a lot of inference.

The semantics seemed somewhat useful here in theory, but since I'm not
hearing a "yeah, that was intentional but not commented", I'm just
going to change it to what you'd naturally expect.

James

#260

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: James Coleman (#259)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

James Coleman <jtc331@gmail.com> writes:

On Tue, Mar 31, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Perhaps the semantics are such that that's actually sensible, but it's
far from a straightforward remapping of the old enum.

Right, I didn't see the explicit "= 0" in other enums there, so it
made me wonder if it was intentional to designate that one had to be
0, but I guess without a comment that's a lot of inference.

It's possible that somebody meant that as an indicator that the code
depends on palloc0() leaving the field with that value. But if so,
you'd soon find that out ... and an actual comment would be better,
anyway.

regards, tom lane

#261

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#255)

7 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The main thing I've been working on today is benchmarking how this
affects planning. And I'm seeing a regression that worries me a bit,
unfortunately.

The test I'm doing is pretty simple - build a small table with a bunch
of columns:

create table t (a int, b int, c int, d int, e int, f int, g int);

insert into t select 100*random(), 100*random(), 100*random(),
100*random(), 100*random(), 100*random(), 100*random()
from generate_series(1,100000) s(i);

and then a number of indexes on subsets of up to 3 columns, as generated
using the attached build-indexes.py script. And then run a bunch of
explains (so no actual execution) sorting the data by at least 4 columns
(to trigger incremental sort paths), measuring timing of the script.

I did a bunch of runs on current master and v46 with incremental sort
disabled and enabled, and the results look like this:

master off on
--------------------------
34.609 37.463 37.729

which means about 8-9% regression with incremental sort. Of course, this
is only for planning time, for execution the impact is going to be much
smaller. But it's still a bit annoying.

I've suspected this might be either due to the add_partial_path changes
or the patch adding incremental sort to additional places, so I tested
those parts individually and the answer is no - add_partial_path changes
have very small impact (~1%, which might be noise). The regression comes
mostly from the 0002 part that adds incremental sort. At least in this
particular test - different tests might behave differently, of course.

The annoying bit is that the overhead does not disappear after disabling
incremental sort. That suggests this is not merely due to considering
and creating higher number of paths, but due to something that happens
before we even look at the GUC ...

I think I've found one such place - if you look at compare_pathkeys, it
has this check right before the forboth() loop:

if (keys1 == keys2)
return PATHKEYS_EQUAL;

But with incremental sort we don't call pathkeys_contained_in, we call
pathkeys_common_contained_in instead. And that does not call
compare_pathkeys and does not have the simple equality check. Adding
the following check seems to cut the overhead in half, which is nice:

if (keys1 == keys2)
{
*n_common = list_length(keys1);
return true;
}

Not sure where the rest of the regression comes from yet.

I noticed in the other function we also optimize by checking if either
keys list is NIL, so I tried adding that, and it might have made a
minor difference, but it's hard to tell as it was under 1%.

I also ran perf with a slightly modified version of your test that
uses psql, and after the above changes was seeing something like a
3.5% delta between master and this patch series. Nothing obvious in
the perf report though.

This test is intended to be somewhat worst case, no? At what point do
we consider the trade-off worth it (given that it's not plausible to
have zero impact)?

Also, while looking at pathkeys_common_contained_in(), I've been a bit
puzzled why does is this correct:

return (key1 == NULL);

It's easy to not notice it's key1 and not keys1, so I suggest we add a
comment 'key1 == NULL' means we've processed whole keys1 list.

Done.

I've included fixes for Alvaro's comments in this patch series also.

James

Attachments:

v48-0003-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v48-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From 04e29add36090528f194c4b25f3326f6688f1d38 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v48 3/7] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 --
 src/backend/optimizer/geqo/geqo_eval.c  |   2 +-
 src/backend/optimizer/path/allpaths.c   | 208 +++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++
 src/backend/optimizer/plan/planner.c    | 346 +++++++++++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 6 files changed, 580 insertions(+), 36 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..93d967e812 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,210 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future. For example, we might want to consider pathkeys useful for
+ * merge joins.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		ListCell   *lc;
+		List	   *pathkeys = NIL;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+
+			/*
+			 * We can only build an Incremental Sort for pathkeys which contain
+			 * an EC member in the current relation, so ignore any suffix of the
+			 * list as soon as we find a pathkey without an EC member the
+			 * relation.
+			 *
+			 * By still returning the prefix of the pathkeys list that does meet
+			 * criteria of EC membership in the current relation, we enable not
+			 * just an incremental sort on the entirety of query_pathkeys but
+			 * also incremental sort below a JOIN.
+			 */
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
+				break;
+
+			pathkeys = lappend(pathkeys, pathkey);
+		}
+
+		if (pathkeys)
+			useful_pathkeys_list = lappend(useful_pathkeys_list, pathkeys);
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3103,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4d7a68d051..73b7782dcb 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5079,6 +5079,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6433,7 +6494,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6492,6 +6555,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, no point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6503,12 +6640,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6539,6 +6682,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6810,6 +7003,58 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/* Consider incremental sort on all partial paths, if enabled. */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6818,7 +7063,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6853,6 +7100,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -6950,10 +7247,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -6979,6 +7277,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7080,7 +7418,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
@@ -7234,7 +7572,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..665f4065a4 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.17.1

v48-0005-typo.patchtext/x-patch; charset=US-ASCII; name=v48-0005-typo.patchDownload

From a65b9b5e033a815f46a672c41ecb352b7b44c3a4 Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Tue, 31 Mar 2020 17:17:34 +0000
Subject: [PATCH v48 5/7] typo

---
 src/include/nodes/execnodes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index adc4e24982..05c03a8fde 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2026,7 +2026,7 @@ typedef struct SortState
 } SortState;
 
 /* ----------------
- *	 Instruementation information for IncrementalSort
+ *	 Instrumentation information for IncrementalSort
  * ----------------
  */
 typedef struct IncrementalSortGroupInfo
-- 
2.17.1

v48-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v48-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 187024ae1f0c3888de4cdf3d4628c099a929d66b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v48 1/7] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v48-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v48-0002-Implement-incremental-sort.patchDownload

From 9af0c6143e8eaa7e89837fb578e99f097e63c2d3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v48 2/7] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 doc/src/sgml/perform.sgml                     |   42 +-
 src/backend/commands/explain.c                |  239 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1263 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   74 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  306 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |   14 +-
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1399 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 41 files changed, 4192 insertions(+), 165 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..675059953b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4554,6 +4554,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort steps.
+        The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index ab090441cf..ee8933861c 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -291,7 +291,47 @@ EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42;
     often see this plan type for queries that fetch just a single row.  It's
     also often used for queries that have an <literal>ORDER BY</literal> condition
     that matches the index order, because then no extra sorting step is needed
-    to satisfy the <literal>ORDER BY</literal>.
+    to satisfy the <literal>ORDER BY</literal>.  In this example, adding
+    <literal>ORDER BY unique1</literal> would use the same plan because the
+    index already implicitly provides the requested ordering.
+   </para>
+
+   <para>
+     The planner may implement an <literal>ORDER BY</literal> clause in several
+     ways.  The above example shows that such an ordering clause may be
+     implemented implicitly.  The planner may also add an explicit
+     <literal>sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY unique1;
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort  (cost=1109.39..1134.39 rows=10000 width=244)
+   Sort Key: unique1
+   ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
+</screen>
+
+    If the a part of the plan guarantess an ordering on a prefix of the
+    required sort keys, then the planner may instead decide to use an
+    <literal>incremental sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY four, ten LIMIT 100;
+                                              QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ Limit  (cost=521.06..538.05 rows=100 width=244)
+   ->  Incremental Sort  (cost=521.06..2220.95 rows=10000 width=244)
+         Sort Key: four, ten
+         Presorted Key: four
+         ->  Index Scan using index_tenk1_on_four on tenk1  (cost=0.29..1510.08 rows=10000 width=244)
+</screen>
+
+    Compared to regular sorts, sorting incrementally allows returning tuples
+    before the entire result set has been sorted, which particularly enables
+    optimizations with <literal>LIMIT</literal> queries.  It may also reduce
+    memory usage and the likelihood of spilling sorts to disk, but it comes at
+    the cost of the increased overhead of splitting the result set into multiple
+    sorting batches.
    </para>
 
    <para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee0e638f33..583906d1bd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->nPresortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,196 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, bool indent, ExplainState *es)
+{
+	ListCell   *methodCell;
+	List	   *methodNames = NIL;
+
+	/* Generate a list of sort methods used across all groups. */
+	for (int bit = 0; bit < sizeof(Size); ++bit)
+	{
+		if (groupInfo->sortMethods & (1 << bit))
+		{
+			TuplesortMethod sortMethod = (1 << bit);
+			const char *methodName;
+
+			methodName = tuplesort_method_name(sortMethod);
+			methodNames = lappend(methodNames, unconstify(char *, methodName));
+		}
+	}
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (indent)
+			appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld Sort Method", groupLabel,
+						 groupInfo->groupCount);
+		/* plural/singular based on methodNames size */
+		if (list_length(methodNames) > 1)
+			appendStringInfo(es->str, "s: ");
+		else
+			appendStringInfo(es->str, ": ");
+		foreach(methodCell, methodNames)
+		{
+			appendStringInfo(es->str, "%s", (char *) methodCell->ptr_value);
+			if (foreach_current_index(methodCell) < list_length(methodNames) - 1)
+				appendStringInfo(es->str, ", ");
+		}
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+	}
+	else
+	{
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+			appendStringInfo(es->str, " ");
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+	}
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+		appendStringInfo(es->str, "\n");
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		indent_first_line;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
+
+			indent_first_line = es->workers_state == NULL || es->verbose;
+			show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+											 indent_first_line, es);
+			if (prefixsortGroupInfo->groupCount > 0)
+			{
+				if (es->format == EXPLAIN_FORMAT_TEXT)
+					appendStringInfo(es->str, " ");
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+			}
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "\n");
+
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..bcab7c054c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1263 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * We need to store the instrumentation information in either local node's sort
+ * info or, for a parallel worker process, in the shared info (this avoids
+ * having to additionally memcpy the info from local memory to shared memory
+ * at each instrumentation call). This macro expands to choose the proper sort
+ * state and group info.
+ *
+ * Arguments:
+ * - node: type IncrementalSortState *
+ * - groupName: the token fullsort or prefixsort
+ */
+#define INSTRUMENT_SORT_GROUP(node, groupName) \
+	if (node->ss.ps.instrument != NULL) \
+	{ \
+		if (node->shared_info && node->am_worker) \
+		{ \
+			Assert(IsParallelWorker()); \
+			Assert(ParallelWorkerNumber <= node->shared_info->num_workers); \
+			instrumentSortedGroup(&node->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, node->groupName##_state); \
+		} else { \
+			instrumentSortedGroup(&node->incsort_info.groupName##GroupInfo, node->groupName##_state); \
+		} \
+	}
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	TuplesortInstrumentation sort_instr;
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	groupInfo->sortMethods |= sort_instr.sortMethod;
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->nPresortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->nPresortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			nPresortedCols;
+
+	nPresortedCols = castNode(IncrementalSort, node->ss.ps.plan)->nPresortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = nPresortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			nPresortedCols = plannode->nPresortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - nPresortedCols,
+												&(plannode->sort.sortColIdx[nPresortedCols]),
+												&(plannode->sort.sortOperators[nPresortedCols]),
+												&(plannode->sort.collations[nPresortedCols]),
+												&(plannode->sort.nullsFirst[nPresortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					INSTRUMENT_SORT_GROUP(node, fullsort)
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = 0;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = 0;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c9a90d1191..29da0a6fbb 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(nPresortedCols);
 
 	return newnode;
 }
@@ -4896,6 +4930,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index eb168ffd6d..f1271b6aca 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(nPresortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3784,6 +3800,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..2a2f39bf04 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(nPresortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 9e7e57f118..8a52271692 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..3b84feaf7b 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,49 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1829,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..5be9135646 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int nPresortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int nPresortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->nPresortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int nPresortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->nPresortedCols = nPresortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'nPresortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int nPresortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, nPresortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f52226ccec..4d7a68d051 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4924,13 +4924,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4964,29 +4967,66 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..5e752f64b9 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->nPresortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..fe87d549d9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..427e5e967e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -358,6 +358,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..cc33a85731 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1296,23 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1370,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2764,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2814,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3311,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0fb5d61a3f..adc4e24982 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1982,6 +1982,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2010,6 +2025,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instruementation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	Size		sortMethods; /* bitmask of TuplesortMethod */
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 5334a73b53..bb2cb70709 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1621,6 +1621,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..be8ef54a1e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..96e970339c 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -61,14 +61,17 @@ typedef struct SortCoordinateData *SortCoordinate;
  * Data structures for reporting sort statistics.  Note that
  * TuplesortInstrumentation can't contain any pointers because we
  * sometimes put it in shared memory.
+ *
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
  */
 typedef enum
 {
 	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_TOP_N_HEAPSORT = 2,
+	SORT_TYPE_QUICKSORT = 4,
+	SORT_TYPE_EXTERNAL_SORT = 8,
+	SORT_TYPE_EXTERNAL_MERGE = 16
 } TuplesortMethod;
 
 typedef enum
@@ -215,6 +218,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +243,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..288a5b2101
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1399 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                 
+------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "top-N heapsort",               +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                                                           explain_analyze_without_memory                                                            
+-----------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 69724d54b9..9ac816177e 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index 331d92708d..f63e71c075 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

v48-0004-optimization.patchtext/x-patch; charset=US-ASCII; name=v48-0004-optimization.patchDownload

From 6499501194b6e96782edd97805571dd390c0badc Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Tue, 31 Mar 2020 08:27:20 -0400
Subject: [PATCH v48 4/7] optimization

---
 src/backend/optimizer/path/pathkeys.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 3b84feaf7b..71fb790d35 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -346,6 +346,30 @@ pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
 	ListCell   *key1,
 			   *key2;
 
+	/*
+	 * See if we can avoiding looping through both lists. This optimization
+	 * gains us several percent in planning time in a worst-case test.
+	 */
+	if (keys1 == keys2)
+	{
+		*n_common = list_length(keys1);
+		return true;
+	}
+	else if (keys1 == NIL)
+	{
+		*n_common = 0;
+		return true;
+	}
+	else if (keys2 == NIL)
+	{
+		*n_common = 0;
+		return false;
+	}
+
+	/*
+	 * If both lists are non-empty, iterate through both to find out how many
+	 * items are shared.
+	 */
 	forboth(key1, keys1, key2, keys2)
 	{
 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
@@ -359,6 +383,7 @@ pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
 		n++;
 	}
 
+	/* If we ended with a null value, then we've processed the whole list. */
 	*n_common = n;
 	return (key1 == NULL);
 }
-- 
2.17.1

v48-0006-enum-bitmask-style.patchtext/x-patch; charset=US-ASCII; name=v48-0006-enum-bitmask-style.patchDownload

From 98dbbbb5ee127459df82e13be8ab17a47a66d4b0 Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Tue, 31 Mar 2020 17:17:43 +0000
Subject: [PATCH v48 6/7] enum bitmask style

---
 src/include/utils/tuplesort.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 96e970339c..8d00a9e501 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -67,11 +67,11 @@ typedef struct SortCoordinateData *SortCoordinate;
  */
 typedef enum
 {
-	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT = 2,
-	SORT_TYPE_QUICKSORT = 4,
-	SORT_TYPE_EXTERNAL_SORT = 8,
-	SORT_TYPE_EXTERNAL_MERGE = 16
+	SORT_TYPE_STILL_IN_PROGRESS = 1 << 0,
+	SORT_TYPE_TOP_N_HEAPSORT = 1 << 1,
+	SORT_TYPE_QUICKSORT = 1 << 2,
+	SORT_TYPE_EXTERNAL_SORT = 1 << 3,
+	SORT_TYPE_EXTERNAL_MERGE = 1 << 4
 } TuplesortMethod;
 
 typedef enum
-- 
2.17.1

v48-0007-bits32.patchtext/x-patch; charset=US-ASCII; name=v48-0007-bits32.patchDownload

From 0d634064a3a3811a46c58053932c94b3c1d41f50 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Tue, 31 Mar 2020 14:06:11 -0400
Subject: [PATCH v48 7/7] bits32

---
 src/backend/commands/explain.c | 2 +-
 src/include/nodes/execnodes.h  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 583906d1bd..8aa45a719c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2719,7 +2719,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 	List	   *methodNames = NIL;
 
 	/* Generate a list of sort methods used across all groups. */
-	for (int bit = 0; bit < sizeof(Size); ++bit)
+	for (int bit = 0; bit < sizeof(bits32); ++bit)
 	{
 		if (groupInfo->sortMethods & (1 << bit))
 		{
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 05c03a8fde..fb490b404c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2036,7 +2036,7 @@ typedef struct IncrementalSortGroupInfo
 	long		totalDiskSpaceUsed;
 	long		maxMemorySpaceUsed;
 	long		totalMemorySpaceUsed;
-	Size		sortMethods; /* bitmask of TuplesortMethod */
+	bits32		sortMethods; /* bitmask of TuplesortMethod */
 } IncrementalSortGroupInfo;
 
 typedef struct IncrementalSortInfo
-- 
2.17.1

#262

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#261)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 02:23:15PM -0400, James Coleman wrote:

On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The main thing I've been working on today is benchmarking how this
affects planning. And I'm seeing a regression that worries me a bit,
unfortunately.

The test I'm doing is pretty simple - build a small table with a bunch
of columns:

create table t (a int, b int, c int, d int, e int, f int, g int);

insert into t select 100*random(), 100*random(), 100*random(),
100*random(), 100*random(), 100*random(), 100*random()
from generate_series(1,100000) s(i);

and then a number of indexes on subsets of up to 3 columns, as generated
using the attached build-indexes.py script. And then run a bunch of
explains (so no actual execution) sorting the data by at least 4 columns
(to trigger incremental sort paths), measuring timing of the script.

I did a bunch of runs on current master and v46 with incremental sort
disabled and enabled, and the results look like this:

master off on
--------------------------
34.609 37.463 37.729

which means about 8-9% regression with incremental sort. Of course, this
is only for planning time, for execution the impact is going to be much
smaller. But it's still a bit annoying.

I've suspected this might be either due to the add_partial_path changes
or the patch adding incremental sort to additional places, so I tested
those parts individually and the answer is no - add_partial_path changes
have very small impact (~1%, which might be noise). The regression comes
mostly from the 0002 part that adds incremental sort. At least in this
particular test - different tests might behave differently, of course.

The annoying bit is that the overhead does not disappear after disabling
incremental sort. That suggests this is not merely due to considering
and creating higher number of paths, but due to something that happens
before we even look at the GUC ...

I think I've found one such place - if you look at compare_pathkeys, it
has this check right before the forboth() loop:

if (keys1 == keys2)
return PATHKEYS_EQUAL;

But with incremental sort we don't call pathkeys_contained_in, we call
pathkeys_common_contained_in instead. And that does not call
compare_pathkeys and does not have the simple equality check. Adding
the following check seems to cut the overhead in half, which is nice:

if (keys1 == keys2)
{
*n_common = list_length(keys1);
return true;
}

Not sure where the rest of the regression comes from yet.

I noticed in the other function we also optimize by checking if either
keys list is NIL, so I tried adding that, and it might have made a
minor difference, but it's hard to tell as it was under 1%.

Which other function? I don't see this optimization in compare_pathkeys,
that only checks key1/key2 after the forboth loop, but that's something
different.

I may be wrong, but my guess would be that this won't save much, because
the loop should terminate immediately. The whole point is not to loop
over possibly many pathkeys when we know it's exactly the same pathkeys
list (same pointer). When one of the lists is NIL the loop should
terminate right away ...

I also ran perf with a slightly modified version of your test that
uses psql, and after the above changes was seeing something like a
3.5% delta between master and this patch series. Nothing obvious in
the perf report though.

Yeah, I don't think there's anything obviously more expensive.

This test is intended to be somewhat worst case, no? At what point do
we consider the trade-off worth it (given that it's not plausible to
have zero impact)?

Yes, more or less. It was definitely designed to do that - it merely
plans the query (no execution), with many applicable indexes etc. It's
definitely true that once the exection starts to take more time, the
overhead will get negligible. Same for reducing number of indexes.

And of course, for queries that can benefit from incremental sort, the
speedup is likely way larger than this.

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

Let's try to minimize the overhead a bit more, and then we'll see.

Also, while looking at pathkeys_common_contained_in(), I've been a bit
puzzled why does is this correct:

return (key1 == NULL);

It's easy to not notice it's key1 and not keys1, so I suggest we add a
comment 'key1 == NULL' means we've processed whole keys1 list.

Done.

I've included fixes for Alvaro's comments in this patch series also.

OK, thanks.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#263

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#262)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 5:53 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 02:23:15PM -0400, James Coleman wrote:

On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The main thing I've been working on today is benchmarking how this
affects planning. And I'm seeing a regression that worries me a bit,
unfortunately.

The test I'm doing is pretty simple - build a small table with a bunch
of columns:

create table t (a int, b int, c int, d int, e int, f int, g int);

insert into t select 100*random(), 100*random(), 100*random(),
100*random(), 100*random(), 100*random(), 100*random()
from generate_series(1,100000) s(i);

and then a number of indexes on subsets of up to 3 columns, as generated
using the attached build-indexes.py script. And then run a bunch of
explains (so no actual execution) sorting the data by at least 4 columns
(to trigger incremental sort paths), measuring timing of the script.

I did a bunch of runs on current master and v46 with incremental sort
disabled and enabled, and the results look like this:

master off on
--------------------------
34.609 37.463 37.729

which means about 8-9% regression with incremental sort. Of course, this
is only for planning time, for execution the impact is going to be much
smaller. But it's still a bit annoying.

I've suspected this might be either due to the add_partial_path changes
or the patch adding incremental sort to additional places, so I tested
those parts individually and the answer is no - add_partial_path changes
have very small impact (~1%, which might be noise). The regression comes
mostly from the 0002 part that adds incremental sort. At least in this
particular test - different tests might behave differently, of course.

The annoying bit is that the overhead does not disappear after disabling
incremental sort. That suggests this is not merely due to considering
and creating higher number of paths, but due to something that happens
before we even look at the GUC ...

I think I've found one such place - if you look at compare_pathkeys, it
has this check right before the forboth() loop:

if (keys1 == keys2)
return PATHKEYS_EQUAL;

But with incremental sort we don't call pathkeys_contained_in, we call
pathkeys_common_contained_in instead. And that does not call
compare_pathkeys and does not have the simple equality check. Adding
the following check seems to cut the overhead in half, which is nice:

if (keys1 == keys2)
{
*n_common = list_length(keys1);
return true;
}

Not sure where the rest of the regression comes from yet.

I noticed in the other function we also optimize by checking if either
keys list is NIL, so I tried adding that, and it might have made a
minor difference, but it's hard to tell as it was under 1%.

Which other function? I don't see this optimization in compare_pathkeys,
that only checks key1/key2 after the forboth loop, but that's something
different.

pathkeys_useful_for_ordering checks both inputs.

I may be wrong, but my guess would be that this won't save much, because
the loop should terminate immediately. The whole point is not to loop
over possibly many pathkeys when we know it's exactly the same pathkeys
list (same pointer). When one of the lists is NIL the loop should
terminate right away ...

I also ran perf with a slightly modified version of your test that
uses psql, and after the above changes was seeing something like a
3.5% delta between master and this patch series. Nothing obvious in
the perf report though.

Yeah, I don't think there's anything obviously more expensive.

This test is intended to be somewhat worst case, no? At what point do
we consider the trade-off worth it (given that it's not plausible to
have zero impact)?

Yes, more or less. It was definitely designed to do that - it merely
plans the query (no execution), with many applicable indexes etc. It's
definitely true that once the exection starts to take more time, the
overhead will get negligible. Same for reducing number of indexes.

And of course, for queries that can benefit from incremental sort, the
speedup is likely way larger than this.

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

Let's try to minimize the overhead a bit more, and then we'll see.

Any thoughts you have already on an approach for this?

James

#264

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Tomas Vondra (#262)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though. The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving. The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

regards, tom lane

#265

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#263)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 06:00:00PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 5:53 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 02:23:15PM -0400, James Coleman wrote:

On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The main thing I've been working on today is benchmarking how this
affects planning. And I'm seeing a regression that worries me a bit,
unfortunately.

The test I'm doing is pretty simple - build a small table with a bunch
of columns:

create table t (a int, b int, c int, d int, e int, f int, g int);

insert into t select 100*random(), 100*random(), 100*random(),
100*random(), 100*random(), 100*random(), 100*random()
from generate_series(1,100000) s(i);

and then a number of indexes on subsets of up to 3 columns, as generated
using the attached build-indexes.py script. And then run a bunch of
explains (so no actual execution) sorting the data by at least 4 columns
(to trigger incremental sort paths), measuring timing of the script.

I did a bunch of runs on current master and v46 with incremental sort
disabled and enabled, and the results look like this:

master off on
--------------------------
34.609 37.463 37.729

which means about 8-9% regression with incremental sort. Of course, this
is only for planning time, for execution the impact is going to be much
smaller. But it's still a bit annoying.

I've suspected this might be either due to the add_partial_path changes
or the patch adding incremental sort to additional places, so I tested
those parts individually and the answer is no - add_partial_path changes
have very small impact (~1%, which might be noise). The regression comes
mostly from the 0002 part that adds incremental sort. At least in this
particular test - different tests might behave differently, of course.

The annoying bit is that the overhead does not disappear after disabling
incremental sort. That suggests this is not merely due to considering
and creating higher number of paths, but due to something that happens
before we even look at the GUC ...

I think I've found one such place - if you look at compare_pathkeys, it
has this check right before the forboth() loop:

if (keys1 == keys2)
return PATHKEYS_EQUAL;

But with incremental sort we don't call pathkeys_contained_in, we call
pathkeys_common_contained_in instead. And that does not call
compare_pathkeys and does not have the simple equality check. Adding
the following check seems to cut the overhead in half, which is nice:

if (keys1 == keys2)
{
*n_common = list_length(keys1);
return true;
}

Not sure where the rest of the regression comes from yet.

I noticed in the other function we also optimize by checking if either
keys list is NIL, so I tried adding that, and it might have made a
minor difference, but it's hard to tell as it was under 1%.

Which other function? I don't see this optimization in compare_pathkeys,
that only checks key1/key2 after the forboth loop, but that's something
different.

pathkeys_useful_for_ordering checks both inputs.

I may be wrong, but my guess would be that this won't save much, because
the loop should terminate immediately. The whole point is not to loop
over possibly many pathkeys when we know it's exactly the same pathkeys
list (same pointer). When one of the lists is NIL the loop should
terminate right away ...

I also ran perf with a slightly modified version of your test that
uses psql, and after the above changes was seeing something like a
3.5% delta between master and this patch series. Nothing obvious in
the perf report though.

Yeah, I don't think there's anything obviously more expensive.

This test is intended to be somewhat worst case, no? At what point do
we consider the trade-off worth it (given that it's not plausible to
have zero impact)?

Yes, more or less. It was definitely designed to do that - it merely
plans the query (no execution), with many applicable indexes etc. It's
definitely true that once the exection starts to take more time, the
overhead will get negligible. Same for reducing number of indexes.

And of course, for queries that can benefit from incremental sort, the
speedup is likely way larger than this.

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

Let's try to minimize the overhead a bit more, and then we'll see.

Any thoughts you have already on an approach for this?

That very much depends on what may be causing the problem. I have two
hypotheses, at the moment.

Based on the profiles I've seen so far, there does not seem to be any
function that suddenly got slower. That probably implies we're simply
generating more paths than before, which means more allocation, mode
add_path calls etc. It's not clear to me why would this happen even with
enable_incrementalsort=off, though.

Or maybe some of the structs got larger and need more cachelines? That
would affect performance even with the GUC set to off. But the perf stat
data also don't show anything particularly revealing. I'm using this:

perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,\
dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses,\
LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches,\
cycles,instructions,cache-references,cache-misses,'
bus-cycles,raw_syscalls:sys_enter -p $PID

An example output (for master and patched branch) is attached, but I
don't see anything obviously worse (there is some variance, of course).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#266

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#264)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though. The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving. The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#267

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#266)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though. The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving. The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:
1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

James

#268

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#267)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though. The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving. The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

I think it depends on what exactly is the expensive part. For example if
it's the construction of IncrementalSort paths, then maybe we could try
do a quick/check check if the path can even be useful by estimating the
cost and only then building the path.

That's what we do for join algorithms, for example - we first compute
initial_cost_nestloop and only when that seems cheap enough we do the
more expensive stuff.

But I'm not sure the path construction is the expensive part, as it
should be disabled by enable_incrementalsort=off. But the regression
does not seem to disappear, at least not entirely.

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:

1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

Hmm, but in case of (1) the overhead should be there even with tests
that don't really have any additional paths to consider, right? I've
tried with such test (single table with no indexes) and I don't quite
see any regression (maybe ~1%).

(2) might have impact, but I don't see any immediate supects. Did you
have some functions in mind?

BTW I see the patch adds pathkeys_common but it's not called from
anywhere. It's probably leftover from an earlier patch version.

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

I've described the idea about something like initial_cost_nestloop and
so on. But I'm a bit skeptical about it, considering that the GUC only
has limited effect.

A somewhat note is that the number of indexes has pretty significant
impact on planning time, even on master. This is timing of the same
explain script (similar to the one shown before) with different number
of indexes on master:

0 indexes 7 indexes 49 indexes
-------------------------------------------
10.85 12.56 27.83

The way I look at incremental sort is that it allows using indexes for
queries that couldn't use it before, possibly requiring a separate
index. So incremental sort might easily reduce the number of indexes
needed, compensating for the overhead we're discussing here. Of course,
that may or may not be true.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#269

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#268)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though. The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving. The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

I think it depends on what exactly is the expensive part. For example if
it's the construction of IncrementalSort paths, then maybe we could try
do a quick/check check if the path can even be useful by estimating the
cost and only then building the path.

That's what we do for join algorithms, for example - we first compute
initial_cost_nestloop and only when that seems cheap enough we do the
more expensive stuff.

But I'm not sure the path construction is the expensive part, as it
should be disabled by enable_incrementalsort=off. But the regression
does not seem to disappear, at least not entirely.

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:

1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

Hmm, but in case of (1) the overhead should be there even with tests
that don't really have any additional paths to consider, right? I've
tried with such test (single table with no indexes) and I don't quite
see any regression (maybe ~1%).

Not necessarily, if the cost is in sort costing or useful pathkeys
checking, right? We have run that code even without incremental sort,
but it's changed from master.

(2) might have impact, but I don't see any immediate supects. Did you
have some functions in mind?

I guess this is where the lines blur: I didn't see anything obvious
either, but the changes to sort costing...should probably not have
real impact...but...

BTW I see the patch adds pathkeys_common but it's not called from
anywhere. It's probably leftover from an earlier patch version.

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

BTW, I did this test, and it looks like we can get back something
close to 1% by reverting that initial fix on partial path costing. But
we can't get rid of it all the time, at the very least. *Maybe* we
could condition it on incremental sort, but I'm not convinced that's
the only place it's needed as a fix.

I've described the idea about something like initial_cost_nestloop and
so on. But I'm a bit skeptical about it, considering that the GUC only
has limited effect.

A somewhat note is that the number of indexes has pretty significant
impact on planning time, even on master. This is timing of the same
explain script (similar to the one shown before) with different number
of indexes on master:

0 indexes 7 indexes 49 indexes
-------------------------------------------
10.85 12.56 27.83

The way I look at incremental sort is that it allows using indexes for
queries that couldn't use it before, possibly requiring a separate
index. So incremental sort might easily reduce the number of indexes
needed, compensating for the overhead we're discussing here. Of course,
that may or may not be true.

One small idea, but I'm not yet sure it helps us a whole lot: if the
query pathkeys is only length 1, then we could skip the additional
path creation.

James

#270

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#260)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 2020-Mar-31, Tom Lane wrote:

James Coleman <jtc331@gmail.com> writes:

On Tue, Mar 31, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Perhaps the semantics are such that that's actually sensible, but it's
far from a straightforward remapping of the old enum.

Right, I didn't see the explicit "= 0" in other enums there, so it
made me wonder if it was intentional to designate that one had to be
0, but I guess without a comment that's a lot of inference.

It's possible that somebody meant that as an indicator that the code
depends on palloc0() leaving the field with that value. But if so,
you'd soon find that out ... and an actual comment would be better,
anyway.

git blame fingers this:

commit bf11e7ee2e3607bb67d25aec73aa53b2d7e9961b
Author: Robert Haas <rhaas@postgresql.org>
AuthorDate: Tue Aug 29 13:22:49 2017 -0400
CommitDate: Tue Aug 29 13:26:33 2017 -0400

Propagate sort instrumentation from workers back to leader.

Up until now, when parallel query was used, no details about the
sort method or space used by the workers were available; details
were shown only for any sorting done by the leader. Fix that.

Commit 1177ab1dabf72bafee8f19d904cee3a299f25892 forced the test case
added by commit 1f6d515a67ec98194c23a5db25660856c9aab944 to run
without parallelism; now that we have this infrastructure, allow
that again, with a little tweaking to make it pass with and without
force_parallel_mode.

Robert Haas and Tom Lane

Discussion: /messages/by-id/CA+Tgmoa2VBZW6S8AAXfhpHczb=Rf6RqQ2br+zJvEgwJ0uoD_tQ@mail.gmail.com

I looked at the discussion thread. That patch was first posted by
Robert at
/messages/by-id/CA+Tgmoa2VBZW6S8AAXfhpHczb=Rf6RqQ2br+zJvEgwJ0uoD_tQ@mail.gmail.com
without the "= 0" part; later Tom posted v2 here
/messages/by-id/11223.1503695532@sss.pgh.pa.us
containing the "= 0", but I see no actual discussion of that point.

I suppose it could also be important to clarify that it's 0 if it were
used as an array index of some sort, but I don't see that in 2017's
commit.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#271

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#269)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though. The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving. The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

I think it depends on what exactly is the expensive part. For example if
it's the construction of IncrementalSort paths, then maybe we could try
do a quick/check check if the path can even be useful by estimating the
cost and only then building the path.

That's what we do for join algorithms, for example - we first compute
initial_cost_nestloop and only when that seems cheap enough we do the
more expensive stuff.

But I'm not sure the path construction is the expensive part, as it
should be disabled by enable_incrementalsort=off. But the regression
does not seem to disappear, at least not entirely.

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:

1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

Hmm, but in case of (1) the overhead should be there even with tests
that don't really have any additional paths to consider, right? I've
tried with such test (single table with no indexes) and I don't quite
see any regression (maybe ~1%).

Not necessarily, if the cost is in sort costing or useful pathkeys
checking, right? We have run that code even without incremental sort,
but it's changed from master.

Ah, I should have mentioned I've done most of the tests on just the
basic incremental sort patch (0001+0002), without the additional useful
paths. I initially tested the whole patch series, but after discovering
the regression I removed the last part (which I suspected might be the
root cause). But the regression is still there, so it's not that.

It might be in the reworked costing, yeah. But then I'd expect those
function to show in the perf profile.

(2) might have impact, but I don't see any immediate supects. Did you
have some functions in mind?

I guess this is where the lines blur: I didn't see anything obvious
either, but the changes to sort costing...should probably not have
real impact...but...

:-(

BTW I see the patch adds pathkeys_common but it's not called from
anywhere. It's probably leftover from an earlier patch version.

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

BTW, I did this test, and it looks like we can get back something
close to 1% by reverting that initial fix on partial path costing. But
we can't get rid of it all the time, at the very least. *Maybe* we
could condition it on incremental sort, but I'm not convinced that's
the only place it's needed as a fix.

Sounds interesting. I actually tried how much the add_partial_path
change accounts for, and you're right it was quite a bit. But I forgot
about that when investigating the rest.

I wonder how large would the regression be without add_partial_path and
with the fix in pathkeys_common_contained_in.

I'm not sure how much we want to make add_partial_path() dependent on
particular GUCs, but I guess if it gets rid of the regression, allows us
to commit incremental sort and we can reasonably justify that only
incremental sort needs those paths, it might be acceptable.

I've described the idea about something like initial_cost_nestloop and
so on. But I'm a bit skeptical about it, considering that the GUC only
has limited effect.

A somewhat note is that the number of indexes has pretty significant
impact on planning time, even on master. This is timing of the same
explain script (similar to the one shown before) with different number
of indexes on master:

0 indexes 7 indexes 49 indexes
-------------------------------------------
10.85 12.56 27.83

The way I look at incremental sort is that it allows using indexes for
queries that couldn't use it before, possibly requiring a separate
index. So incremental sort might easily reduce the number of indexes
needed, compensating for the overhead we're discussing here. Of course,
that may or may not be true.

One small idea, but I'm not yet sure it helps us a whole lot: if the
query pathkeys is only length 1, then we could skip the additional
path creation.

I don't follow. Why would we create incremental sort in this case at
all? With single-element query_pathkeys the path is either unsorted or
fully sorted - there's no room for incremental sort. No?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#272

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#271)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though. The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving. The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

I think it depends on what exactly is the expensive part. For example if
it's the construction of IncrementalSort paths, then maybe we could try
do a quick/check check if the path can even be useful by estimating the
cost and only then building the path.

That's what we do for join algorithms, for example - we first compute
initial_cost_nestloop and only when that seems cheap enough we do the
more expensive stuff.

But I'm not sure the path construction is the expensive part, as it
should be disabled by enable_incrementalsort=off. But the regression
does not seem to disappear, at least not entirely.

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:

1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

Hmm, but in case of (1) the overhead should be there even with tests
that don't really have any additional paths to consider, right? I've
tried with such test (single table with no indexes) and I don't quite
see any regression (maybe ~1%).

Not necessarily, if the cost is in sort costing or useful pathkeys
checking, right? We have run that code even without incremental sort,
but it's changed from master.

Ah, I should have mentioned I've done most of the tests on just the
basic incremental sort patch (0001+0002), without the additional useful
paths. I initially tested the whole patch series, but after discovering
the regression I removed the last part (which I suspected might be the
root cause). But the regression is still there, so it's not that.

It might be in the reworked costing, yeah. But then I'd expect those
function to show in the perf profile.

Right. I'm just grasping at straws on that.

(2) might have impact, but I don't see any immediate supects. Did you
have some functions in mind?

I guess this is where the lines blur: I didn't see anything obvious
either, but the changes to sort costing...should probably not have
real impact...but...

:-(

BTW I see the patch adds pathkeys_common but it's not called from
anywhere. It's probably leftover from an earlier patch version.

BTW, I think I'm going to rename the pathkeys_common_contained_in
function to something like pathkeys_count_contained_in, unless you
have an objection to that. The name doesn't seem obvious at all to me.

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

BTW, I did this test, and it looks like we can get back something
close to 1% by reverting that initial fix on partial path costing. But
we can't get rid of it all the time, at the very least. *Maybe* we
could condition it on incremental sort, but I'm not convinced that's
the only place it's needed as a fix.

Sounds interesting. I actually tried how much the add_partial_path
change accounts for, and you're right it was quite a bit. But I forgot
about that when investigating the rest.

I wonder how large would the regression be without add_partial_path and
with the fix in pathkeys_common_contained_in.

I'm not sure how much we want to make add_partial_path() dependent on
particular GUCs, but I guess if it gets rid of the regression, allows us
to commit incremental sort and we can reasonably justify that only
incremental sort needs those paths, it might be acceptable.

That's a good point.

I've described the idea about something like initial_cost_nestloop and
so on. But I'm a bit skeptical about it, considering that the GUC only
has limited effect.

A somewhat note is that the number of indexes has pretty significant
impact on planning time, even on master. This is timing of the same
explain script (similar to the one shown before) with different number
of indexes on master:

0 indexes 7 indexes 49 indexes
-------------------------------------------
10.85 12.56 27.83

The way I look at incremental sort is that it allows using indexes for
queries that couldn't use it before, possibly requiring a separate
index. So incremental sort might easily reduce the number of indexes
needed, compensating for the overhead we're discussing here. Of course,
that may or may not be true.

One small idea, but I'm not yet sure it helps us a whole lot: if the
query pathkeys is only length 1, then we could skip the additional
path creation.

I don't follow. Why would we create incremental sort in this case at
all? With single-element query_pathkeys the path is either unsorted or
fully sorted - there's no room for incremental sort. No?

Well, we shouldn't, that's what I'm getting. But I didn't see anything
in the code now that explicitly excludes that case when decided
whether or not to create an incremental sort path, unless I'm missing
something obvious.

James

#273

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#272)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though. The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving. The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

I think it depends on what exactly is the expensive part. For example if
it's the construction of IncrementalSort paths, then maybe we could try
do a quick/check check if the path can even be useful by estimating the
cost and only then building the path.

That's what we do for join algorithms, for example - we first compute
initial_cost_nestloop and only when that seems cheap enough we do the
more expensive stuff.

But I'm not sure the path construction is the expensive part, as it
should be disabled by enable_incrementalsort=off. But the regression
does not seem to disappear, at least not entirely.

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:

1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

Hmm, but in case of (1) the overhead should be there even with tests
that don't really have any additional paths to consider, right? I've
tried with such test (single table with no indexes) and I don't quite
see any regression (maybe ~1%).

Not necessarily, if the cost is in sort costing or useful pathkeys
checking, right? We have run that code even without incremental sort,
but it's changed from master.

Ah, I should have mentioned I've done most of the tests on just the
basic incremental sort patch (0001+0002), without the additional useful
paths. I initially tested the whole patch series, but after discovering
the regression I removed the last part (which I suspected might be the
root cause). But the regression is still there, so it's not that.

It might be in the reworked costing, yeah. But then I'd expect those
function to show in the perf profile.

Right. I'm just grasping at straws on that.

(2) might have impact, but I don't see any immediate supects. Did you
have some functions in mind?

I guess this is where the lines blur: I didn't see anything obvious
either, but the changes to sort costing...should probably not have
real impact...but...

:-(

BTW I see the patch adds pathkeys_common but it's not called from
anywhere. It's probably leftover from an earlier patch version.

BTW, I think I'm going to rename the pathkeys_common_contained_in
function to something like pathkeys_count_contained_in, unless you
have an objection to that. The name doesn't seem obvious at all to me.

WFM

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

BTW, I did this test, and it looks like we can get back something
close to 1% by reverting that initial fix on partial path costing. But
we can't get rid of it all the time, at the very least. *Maybe* we
could condition it on incremental sort, but I'm not convinced that's
the only place it's needed as a fix.

Sounds interesting. I actually tried how much the add_partial_path
change accounts for, and you're right it was quite a bit. But I forgot
about that when investigating the rest.

I wonder how large would the regression be without add_partial_path and
with the fix in pathkeys_common_contained_in.

I'm not sure how much we want to make add_partial_path() dependent on
particular GUCs, but I guess if it gets rid of the regression, allows us
to commit incremental sort and we can reasonably justify that only
incremental sort needs those paths, it might be acceptable.

That's a good point.

I've described the idea about something like initial_cost_nestloop and
so on. But I'm a bit skeptical about it, considering that the GUC only
has limited effect.

A somewhat note is that the number of indexes has pretty significant
impact on planning time, even on master. This is timing of the same
explain script (similar to the one shown before) with different number
of indexes on master:

0 indexes 7 indexes 49 indexes
-------------------------------------------
10.85 12.56 27.83

The way I look at incremental sort is that it allows using indexes for
queries that couldn't use it before, possibly requiring a separate
index. So incremental sort might easily reduce the number of indexes
needed, compensating for the overhead we're discussing here. Of course,
that may or may not be true.

One small idea, but I'm not yet sure it helps us a whole lot: if the
query pathkeys is only length 1, then we could skip the additional
path creation.

I don't follow. Why would we create incremental sort in this case at
all? With single-element query_pathkeys the path is either unsorted or
fully sorted - there's no room for incremental sort. No?

Well, we shouldn't, that's what I'm getting. But I didn't see anything
in the code now that explicitly excludes that case when decided
whether or not to create an incremental sort path, unless I'm missing
something obvious.

Well, my point is that create_ordered_paths() looks like this:

is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);

if (is_sorted)
{
... old code
}
else
{
if (input_path == cheapest_input_path)
{
... old code
}

/* With incremental sort disabled, don't build those paths. */
if (!enable_incrementalsort)
continue;

/* Likewise, if the path can't be used for incremental sort. */
if (!presorted_keys)
continue;

... incremental sort path
}

Now, with single-item sort_pathkeys, the input path can't be partially
sorted. It's either fully sorted - in which case it's handled by the
first branch. Or it's not sorted at all, so presorted_keys==0 and we
never get to the incremental path.

Or did you mean to use the optimization somewhere else?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#274

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#273)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though. The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving. The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

I think it depends on what exactly is the expensive part. For example if
it's the construction of IncrementalSort paths, then maybe we could try
do a quick/check check if the path can even be useful by estimating the
cost and only then building the path.

That's what we do for join algorithms, for example - we first compute
initial_cost_nestloop and only when that seems cheap enough we do the
more expensive stuff.

But I'm not sure the path construction is the expensive part, as it
should be disabled by enable_incrementalsort=off. But the regression
does not seem to disappear, at least not entirely.

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:

1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

Hmm, but in case of (1) the overhead should be there even with tests
that don't really have any additional paths to consider, right? I've
tried with such test (single table with no indexes) and I don't quite
see any regression (maybe ~1%).

Not necessarily, if the cost is in sort costing or useful pathkeys
checking, right? We have run that code even without incremental sort,
but it's changed from master.

Ah, I should have mentioned I've done most of the tests on just the
basic incremental sort patch (0001+0002), without the additional useful
paths. I initially tested the whole patch series, but after discovering
the regression I removed the last part (which I suspected might be the
root cause). But the regression is still there, so it's not that.

It might be in the reworked costing, yeah. But then I'd expect those
function to show in the perf profile.

Right. I'm just grasping at straws on that.

(2) might have impact, but I don't see any immediate supects. Did you
have some functions in mind?

I guess this is where the lines blur: I didn't see anything obvious
either, but the changes to sort costing...should probably not have
real impact...but...

:-(

BTW I see the patch adds pathkeys_common but it's not called from
anywhere. It's probably leftover from an earlier patch version.

BTW, I think I'm going to rename the pathkeys_common_contained_in
function to something like pathkeys_count_contained_in, unless you
have an objection to that. The name doesn't seem obvious at all to me.

WFM

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

BTW, I did this test, and it looks like we can get back something
close to 1% by reverting that initial fix on partial path costing. But
we can't get rid of it all the time, at the very least. *Maybe* we
could condition it on incremental sort, but I'm not convinced that's
the only place it's needed as a fix.

Sounds interesting. I actually tried how much the add_partial_path
change accounts for, and you're right it was quite a bit. But I forgot
about that when investigating the rest.

I wonder how large would the regression be without add_partial_path and
with the fix in pathkeys_common_contained_in.

I'm not sure how much we want to make add_partial_path() dependent on
particular GUCs, but I guess if it gets rid of the regression, allows us
to commit incremental sort and we can reasonably justify that only
incremental sort needs those paths, it might be acceptable.

That's a good point.

I've described the idea about something like initial_cost_nestloop and
so on. But I'm a bit skeptical about it, considering that the GUC only
has limited effect.

A somewhat note is that the number of indexes has pretty significant
impact on planning time, even on master. This is timing of the same
explain script (similar to the one shown before) with different number
of indexes on master:

0 indexes 7 indexes 49 indexes
-------------------------------------------
10.85 12.56 27.83

The way I look at incremental sort is that it allows using indexes for
queries that couldn't use it before, possibly requiring a separate
index. So incremental sort might easily reduce the number of indexes
needed, compensating for the overhead we're discussing here. Of course,
that may or may not be true.

One small idea, but I'm not yet sure it helps us a whole lot: if the
query pathkeys is only length 1, then we could skip the additional
path creation.

I don't follow. Why would we create incremental sort in this case at
all? With single-element query_pathkeys the path is either unsorted or
fully sorted - there's no room for incremental sort. No?

Well, we shouldn't, that's what I'm getting. But I didn't see anything
in the code now that explicitly excludes that case when decided
whether or not to create an incremental sort path, unless I'm missing
something obvious.

Well, my point is that create_ordered_paths() looks like this:

is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);

if (is_sorted)
{
... old code
}
else
{
if (input_path == cheapest_input_path)
{
... old code
}

/* With incremental sort disabled, don't build those paths. */
if (!enable_incrementalsort)
continue;

/* Likewise, if the path can't be used for incremental sort. */
if (!presorted_keys)
continue;

... incremental sort path
}

Now, with single-item sort_pathkeys, the input path can't be partially
sorted. It's either fully sorted - in which case it's handled by the
first branch. Or it's not sorted at all, so presorted_keys==0 and we
never get to the incremental path.

Or did you mean to use the optimization somewhere else?

Hmm, yes, I didn't think through that properly. I'll have to look at
the other cases to confirm the same logic applies there.

One other thing:in the code above we create the regular sort path
inside of `if (input_path == cheapest_input_path)`, but incremental
sort is outside of that condition. I'm not sure I'm remembering why
that was, and it's not obvious to me reading it right now (though it's
getting late here, so maybe I'm just not thinking clearly). Do you
happen to remember why that is?

I've included the optimization on the add_partial_path fix and I now
have numbers (for your test, slightly modified in how I execute it)
like:

branch: 0.8354718927735362
master: 0.8128127066707269

Which is a 2.7% regression (with enable_incrementalsort off).

James

#275

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#274)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 10:12:29PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though. The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving. The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

I think it depends on what exactly is the expensive part. For example if
it's the construction of IncrementalSort paths, then maybe we could try
do a quick/check check if the path can even be useful by estimating the
cost and only then building the path.

That's what we do for join algorithms, for example - we first compute
initial_cost_nestloop and only when that seems cheap enough we do the
more expensive stuff.

But I'm not sure the path construction is the expensive part, as it
should be disabled by enable_incrementalsort=off. But the regression
does not seem to disappear, at least not entirely.

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:

1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

Hmm, but in case of (1) the overhead should be there even with tests
that don't really have any additional paths to consider, right? I've
tried with such test (single table with no indexes) and I don't quite
see any regression (maybe ~1%).

Not necessarily, if the cost is in sort costing or useful pathkeys
checking, right? We have run that code even without incremental sort,
but it's changed from master.

Ah, I should have mentioned I've done most of the tests on just the
basic incremental sort patch (0001+0002), without the additional useful
paths. I initially tested the whole patch series, but after discovering
the regression I removed the last part (which I suspected might be the
root cause). But the regression is still there, so it's not that.

It might be in the reworked costing, yeah. But then I'd expect those
function to show in the perf profile.

Right. I'm just grasping at straws on that.

(2) might have impact, but I don't see any immediate supects. Did you
have some functions in mind?

I guess this is where the lines blur: I didn't see anything obvious
either, but the changes to sort costing...should probably not have
real impact...but...

:-(

BTW I see the patch adds pathkeys_common but it's not called from
anywhere. It's probably leftover from an earlier patch version.

BTW, I think I'm going to rename the pathkeys_common_contained_in
function to something like pathkeys_count_contained_in, unless you
have an objection to that. The name doesn't seem obvious at all to me.

WFM

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

BTW, I did this test, and it looks like we can get back something
close to 1% by reverting that initial fix on partial path costing. But
we can't get rid of it all the time, at the very least. *Maybe* we
could condition it on incremental sort, but I'm not convinced that's
the only place it's needed as a fix.

Sounds interesting. I actually tried how much the add_partial_path
change accounts for, and you're right it was quite a bit. But I forgot
about that when investigating the rest.

I wonder how large would the regression be without add_partial_path and
with the fix in pathkeys_common_contained_in.

I'm not sure how much we want to make add_partial_path() dependent on
particular GUCs, but I guess if it gets rid of the regression, allows us
to commit incremental sort and we can reasonably justify that only
incremental sort needs those paths, it might be acceptable.

That's a good point.

I've described the idea about something like initial_cost_nestloop and
so on. But I'm a bit skeptical about it, considering that the GUC only
has limited effect.

A somewhat note is that the number of indexes has pretty significant
impact on planning time, even on master. This is timing of the same
explain script (similar to the one shown before) with different number
of indexes on master:

0 indexes 7 indexes 49 indexes
-------------------------------------------
10.85 12.56 27.83

The way I look at incremental sort is that it allows using indexes for
queries that couldn't use it before, possibly requiring a separate
index. So incremental sort might easily reduce the number of indexes
needed, compensating for the overhead we're discussing here. Of course,
that may or may not be true.

One small idea, but I'm not yet sure it helps us a whole lot: if the
query pathkeys is only length 1, then we could skip the additional
path creation.

I don't follow. Why would we create incremental sort in this case at
all? With single-element query_pathkeys the path is either unsorted or
fully sorted - there's no room for incremental sort. No?

Well, we shouldn't, that's what I'm getting. But I didn't see anything
in the code now that explicitly excludes that case when decided
whether or not to create an incremental sort path, unless I'm missing
something obvious.

Well, my point is that create_ordered_paths() looks like this:

is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);

if (is_sorted)
{
... old code
}
else
{
if (input_path == cheapest_input_path)
{
... old code
}

/* With incremental sort disabled, don't build those paths. */
if (!enable_incrementalsort)
continue;

/* Likewise, if the path can't be used for incremental sort. */
if (!presorted_keys)
continue;

... incremental sort path
}

Now, with single-item sort_pathkeys, the input path can't be partially
sorted. It's either fully sorted - in which case it's handled by the
first branch. Or it's not sorted at all, so presorted_keys==0 and we
never get to the incremental path.

Or did you mean to use the optimization somewhere else?

Hmm, yes, I didn't think through that properly. I'll have to look at
the other cases to confirm the same logic applies there.

One other thing:in the code above we create the regular sort path
inside of `if (input_path == cheapest_input_path)`, but incremental
sort is outside of that condition. I'm not sure I'm remembering why
that was, and it's not obvious to me reading it right now (though it's
getting late here, so maybe I'm just not thinking clearly). Do you
happen to remember why that is?

It's because for the regular sort, the path is either already sorted or
it requires a full sort. But full sort only makes sense on the cheapest
path, because we assume the additional sort cost is independent of the
input cost, essentially

cost(path + Sort) = cost(path) + cost(Sort)

and it's always

cost(path) + cost(Sort) >= cost(cheapest path) + cost(Sort)

and by checking for cheapest path we simply skip building all the paths
that we'd end up discarding anyway.

With incremental sort we can't do this, the cost of the incremental sort
depends on how well presorted is the input path.

I've included the optimization on the add_partial_path fix and I now
have numbers (for your test, slightly modified in how I execute it)
like:

branch: 0.8354718927735362
master: 0.8128127066707269

Which is a 2.7% regression (with enable_incrementalsort off).

Can you try a more realistic benchmark, not this focused on the planner
part? Something like a read-only pgbench with a fairly small data set
and a single client, or something like that?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#276

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#275)

7 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 10:12:29PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though. The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving. The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

I think it depends on what exactly is the expensive part. For example if
it's the construction of IncrementalSort paths, then maybe we could try
do a quick/check check if the path can even be useful by estimating the
cost and only then building the path.

That's what we do for join algorithms, for example - we first compute
initial_cost_nestloop and only when that seems cheap enough we do the
more expensive stuff.

But I'm not sure the path construction is the expensive part, as it
should be disabled by enable_incrementalsort=off. But the regression
does not seem to disappear, at least not entirely.

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:

1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

Hmm, but in case of (1) the overhead should be there even with tests
that don't really have any additional paths to consider, right? I've
tried with such test (single table with no indexes) and I don't quite
see any regression (maybe ~1%).

Not necessarily, if the cost is in sort costing or useful pathkeys
checking, right? We have run that code even without incremental sort,
but it's changed from master.

Ah, I should have mentioned I've done most of the tests on just the
basic incremental sort patch (0001+0002), without the additional useful
paths. I initially tested the whole patch series, but after discovering
the regression I removed the last part (which I suspected might be the
root cause). But the regression is still there, so it's not that.

It might be in the reworked costing, yeah. But then I'd expect those
function to show in the perf profile.

Right. I'm just grasping at straws on that.

(2) might have impact, but I don't see any immediate supects. Did you
have some functions in mind?

I guess this is where the lines blur: I didn't see anything obvious
either, but the changes to sort costing...should probably not have
real impact...but...

:-(

BTW I see the patch adds pathkeys_common but it's not called from
anywhere. It's probably leftover from an earlier patch version.

BTW, I think I'm going to rename the pathkeys_common_contained_in
function to something like pathkeys_count_contained_in, unless you
have an objection to that. The name doesn't seem obvious at all to me.

WFM

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

BTW, I did this test, and it looks like we can get back something
close to 1% by reverting that initial fix on partial path costing. But
we can't get rid of it all the time, at the very least. *Maybe* we
could condition it on incremental sort, but I'm not convinced that's
the only place it's needed as a fix.

Sounds interesting. I actually tried how much the add_partial_path
change accounts for, and you're right it was quite a bit. But I forgot
about that when investigating the rest.

I wonder how large would the regression be without add_partial_path and
with the fix in pathkeys_common_contained_in.

I'm not sure how much we want to make add_partial_path() dependent on
particular GUCs, but I guess if it gets rid of the regression, allows us
to commit incremental sort and we can reasonably justify that only
incremental sort needs those paths, it might be acceptable.

That's a good point.

I've described the idea about something like initial_cost_nestloop and
so on. But I'm a bit skeptical about it, considering that the GUC only
has limited effect.

A somewhat note is that the number of indexes has pretty significant
impact on planning time, even on master. This is timing of the same
explain script (similar to the one shown before) with different number
of indexes on master:

0 indexes 7 indexes 49 indexes
-------------------------------------------
10.85 12.56 27.83

The way I look at incremental sort is that it allows using indexes for
queries that couldn't use it before, possibly requiring a separate
index. So incremental sort might easily reduce the number of indexes
needed, compensating for the overhead we're discussing here. Of course,
that may or may not be true.

One small idea, but I'm not yet sure it helps us a whole lot: if the
query pathkeys is only length 1, then we could skip the additional
path creation.

I don't follow. Why would we create incremental sort in this case at
all? With single-element query_pathkeys the path is either unsorted or
fully sorted - there's no room for incremental sort. No?

Well, we shouldn't, that's what I'm getting. But I didn't see anything
in the code now that explicitly excludes that case when decided
whether or not to create an incremental sort path, unless I'm missing
something obvious.

Well, my point is that create_ordered_paths() looks like this:

is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);

if (is_sorted)
{
... old code
}
else
{
if (input_path == cheapest_input_path)
{
... old code
}

/* With incremental sort disabled, don't build those paths. */
if (!enable_incrementalsort)
continue;

/* Likewise, if the path can't be used for incremental sort. */
if (!presorted_keys)
continue;

... incremental sort path
}

Now, with single-item sort_pathkeys, the input path can't be partially
sorted. It's either fully sorted - in which case it's handled by the
first branch. Or it's not sorted at all, so presorted_keys==0 and we
never get to the incremental path.

Or did you mean to use the optimization somewhere else?

Hmm, yes, I didn't think through that properly. I'll have to look at
the other cases to confirm the same logic applies there.

One other thing:in the code above we create the regular sort path
inside of `if (input_path == cheapest_input_path)`, but incremental
sort is outside of that condition. I'm not sure I'm remembering why
that was, and it's not obvious to me reading it right now (though it's
getting late here, so maybe I'm just not thinking clearly). Do you
happen to remember why that is?

It's because for the regular sort, the path is either already sorted or
it requires a full sort. But full sort only makes sense on the cheapest
path, because we assume the additional sort cost is independent of the
input cost, essentially

cost(path + Sort) = cost(path) + cost(Sort)

and it's always

cost(path) + cost(Sort) >= cost(cheapest path) + cost(Sort)

and by checking for cheapest path we simply skip building all the paths
that we'd end up discarding anyway.

With incremental sort we can't do this, the cost of the incremental sort
depends on how well presorted is the input path.

I've included the optimization on the add_partial_path fix and I now
have numbers (for your test, slightly modified in how I execute it)
like:

branch: 0.8354718927735362
master: 0.8128127066707269

Which is a 2.7% regression (with enable_incrementalsort off).

Can you try a more realistic benchmark, not this focused on the planner
part? Something like a read-only pgbench with a fairly small data set
and a single client, or something like that?

A default pgbench run with select-only for 60s got me 99.93% on the
branch of the speed of master.

I've attached my current updates (with the optimization in add_partial_path).

To add some weight to the "stuff beyond the patch's control" theory
I'm pretty sure I've gotten ~1% repeated differences with the included
v49-0004-ignore-single-key-orderings.patch even though that shouldn't
change anything *both* because enable_incremental_sort is off *and*
because logically it shouldn't be needed (though I still haven't
confirmed in all cases)...so that's interesting. I'm not suggesting we
include the patch, but wanted you to at least see it.

I can look at some more pgbench stuff tomorrow, but for now I'm
signing off for the night.

James

Attachments:

v49-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v49-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 187024ae1f0c3888de4cdf3d4628c099a929d66b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v49 1/7] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v49-0003-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v49-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From e3beea272848409c074cb44551f641b51dcf551b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v49 3/7] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 --
 src/backend/optimizer/geqo/geqo_eval.c  |   2 +-
 src/backend/optimizer/path/allpaths.c   | 208 +++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++
 src/backend/optimizer/plan/planner.c    | 346 +++++++++++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 6 files changed, 580 insertions(+), 36 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..93d967e812 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,210 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future. For example, we might want to consider pathkeys useful for
+ * merge joins.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		ListCell   *lc;
+		List	   *pathkeys = NIL;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+
+			/*
+			 * We can only build an Incremental Sort for pathkeys which contain
+			 * an EC member in the current relation, so ignore any suffix of the
+			 * list as soon as we find a pathkey without an EC member the
+			 * relation.
+			 *
+			 * By still returning the prefix of the pathkeys list that does meet
+			 * criteria of EC membership in the current relation, we enable not
+			 * just an incremental sort on the entirety of query_pathkeys but
+			 * also incremental sort below a JOIN.
+			 */
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
+				break;
+
+			pathkeys = lappend(pathkeys, pathkey);
+		}
+
+		if (pathkeys)
+			useful_pathkeys_list = lappend(useful_pathkeys_list, pathkeys);
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3103,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4d7a68d051..73b7782dcb 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5079,6 +5079,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6433,7 +6494,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6492,6 +6555,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, no point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6503,12 +6640,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6539,6 +6682,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6810,6 +7003,58 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/* Consider incremental sort on all partial paths, if enabled. */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6818,7 +7063,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6853,6 +7100,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -6950,10 +7247,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -6979,6 +7277,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7080,7 +7418,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
@@ -7234,7 +7572,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 85f5fe37ea..665f4065a4 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.17.1

v49-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v49-0002-Implement-incremental-sort.patchDownload

From 852b180474e3c1d69b9585beed4f2d5d02d9d92a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v49 2/7] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 doc/src/sgml/perform.sgml                     |   42 +-
 src/backend/commands/explain.c                |  239 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1263 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   86 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   74 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  306 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |   16 +-
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1399 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 41 files changed, 4218 insertions(+), 166 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..675059953b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4554,6 +4554,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort steps.
+        The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index ab090441cf..ee8933861c 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -291,7 +291,47 @@ EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42;
     often see this plan type for queries that fetch just a single row.  It's
     also often used for queries that have an <literal>ORDER BY</literal> condition
     that matches the index order, because then no extra sorting step is needed
-    to satisfy the <literal>ORDER BY</literal>.
+    to satisfy the <literal>ORDER BY</literal>.  In this example, adding
+    <literal>ORDER BY unique1</literal> would use the same plan because the
+    index already implicitly provides the requested ordering.
+   </para>
+
+   <para>
+     The planner may implement an <literal>ORDER BY</literal> clause in several
+     ways.  The above example shows that such an ordering clause may be
+     implemented implicitly.  The planner may also add an explicit
+     <literal>sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY unique1;
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort  (cost=1109.39..1134.39 rows=10000 width=244)
+   Sort Key: unique1
+   ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
+</screen>
+
+    If the a part of the plan guarantess an ordering on a prefix of the
+    required sort keys, then the planner may instead decide to use an
+    <literal>incremental sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY four, ten LIMIT 100;
+                                              QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ Limit  (cost=521.06..538.05 rows=100 width=244)
+   ->  Incremental Sort  (cost=521.06..2220.95 rows=10000 width=244)
+         Sort Key: four, ten
+         Presorted Key: four
+         ->  Index Scan using index_tenk1_on_four on tenk1  (cost=0.29..1510.08 rows=10000 width=244)
+</screen>
+
+    Compared to regular sorts, sorting incrementally allows returning tuples
+    before the entire result set has been sorted, which particularly enables
+    optimizations with <literal>LIMIT</literal> queries.  It may also reduce
+    memory usage and the likelihood of spilling sorts to disk, but it comes at
+    the cost of the increased overhead of splitting the result set into multiple
+    sorting batches.
    </para>
 
    <para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee0e638f33..8aa45a719c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->nPresortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,196 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, bool indent, ExplainState *es)
+{
+	ListCell   *methodCell;
+	List	   *methodNames = NIL;
+
+	/* Generate a list of sort methods used across all groups. */
+	for (int bit = 0; bit < sizeof(bits32); ++bit)
+	{
+		if (groupInfo->sortMethods & (1 << bit))
+		{
+			TuplesortMethod sortMethod = (1 << bit);
+			const char *methodName;
+
+			methodName = tuplesort_method_name(sortMethod);
+			methodNames = lappend(methodNames, unconstify(char *, methodName));
+		}
+	}
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (indent)
+			appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld Sort Method", groupLabel,
+						 groupInfo->groupCount);
+		/* plural/singular based on methodNames size */
+		if (list_length(methodNames) > 1)
+			appendStringInfo(es->str, "s: ");
+		else
+			appendStringInfo(es->str, ": ");
+		foreach(methodCell, methodNames)
+		{
+			appendStringInfo(es->str, "%s", (char *) methodCell->ptr_value);
+			if (foreach_current_index(methodCell) < list_length(methodNames) - 1)
+				appendStringInfo(es->str, ", ");
+		}
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+	}
+	else
+	{
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+			appendStringInfo(es->str, " ");
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+	}
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+		appendStringInfo(es->str, "\n");
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		indent_first_line;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
+
+			indent_first_line = es->workers_state == NULL || es->verbose;
+			show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+											 indent_first_line, es);
+			if (prefixsortGroupInfo->groupCount > 0)
+			{
+				if (es->format == EXPLAIN_FORMAT_TEXT)
+					appendStringInfo(es->str, " ");
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+			}
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "\n");
+
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..bcab7c054c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1263 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * We need to store the instrumentation information in either local node's sort
+ * info or, for a parallel worker process, in the shared info (this avoids
+ * having to additionally memcpy the info from local memory to shared memory
+ * at each instrumentation call). This macro expands to choose the proper sort
+ * state and group info.
+ *
+ * Arguments:
+ * - node: type IncrementalSortState *
+ * - groupName: the token fullsort or prefixsort
+ */
+#define INSTRUMENT_SORT_GROUP(node, groupName) \
+	if (node->ss.ps.instrument != NULL) \
+	{ \
+		if (node->shared_info && node->am_worker) \
+		{ \
+			Assert(IsParallelWorker()); \
+			Assert(ParallelWorkerNumber <= node->shared_info->num_workers); \
+			instrumentSortedGroup(&node->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, node->groupName##_state); \
+		} else { \
+			instrumentSortedGroup(&node->incsort_info.groupName##GroupInfo, node->groupName##_state); \
+		} \
+	}
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	TuplesortInstrumentation sort_instr;
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	groupInfo->sortMethods |= sort_instr.sortMethod;
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->nPresortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->nPresortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			nPresortedCols;
+
+	nPresortedCols = castNode(IncrementalSort, node->ss.ps.plan)->nPresortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = nPresortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			nPresortedCols = plannode->nPresortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - nPresortedCols,
+												&(plannode->sort.sortColIdx[nPresortedCols]),
+												&(plannode->sort.sortOperators[nPresortedCols]),
+												&(plannode->sort.collations[nPresortedCols]),
+												&(plannode->sort.nullsFirst[nPresortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					INSTRUMENT_SORT_GROUP(node, fullsort)
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = 0;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = 0;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c9a90d1191..29da0a6fbb 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(nPresortedCols);
 
 	return newnode;
 }
@@ -4896,6 +4930,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index eb168ffd6d..f1271b6aca 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(nPresortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3784,6 +3800,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..2a2f39bf04 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(nPresortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 9e7e57f118..8a52271692 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..71fb790d35 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,74 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	/*
+	 * See if we can avoiding looping through both lists. This optimization
+	 * gains us several percent in planning time in a worst-case test.
+	 */
+	if (keys1 == keys2)
+	{
+		*n_common = list_length(keys1);
+		return true;
+	}
+	else if (keys1 == NIL)
+	{
+		*n_common = 0;
+		return true;
+	}
+	else if (keys2 == NIL)
+	{
+		*n_common = 0;
+		return false;
+	}
+
+	/*
+	 * If both lists are non-empty, iterate through both to find out how many
+	 * items are shared.
+	 */
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	/* If we ended with a null value, then we've processed the whole list. */
+	*n_common = n;
+	return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int			n;
+
+	(void) pathkeys_common_contained_in(keys1, keys2, &n);
+	return n;
+}
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1854,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..5be9135646 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int nPresortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int nPresortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->nPresortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int nPresortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->nPresortedCols = nPresortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'nPresortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int nPresortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, nPresortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f52226ccec..4d7a68d051 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4924,13 +4924,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4964,29 +4967,66 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..5e752f64b9 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->nPresortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..fe87d549d9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..427e5e967e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -358,6 +358,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..cc33a85731 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1296,23 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1370,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2764,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2814,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3311,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0fb5d61a3f..fb490b404c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1982,6 +1982,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2010,6 +2025,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instrumentation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	bits32		sortMethods; /* bitmask of TuplesortMethod */
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 5334a73b53..bb2cb70709 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1621,6 +1621,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..be8ef54a1e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..85f5fe37ea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..8d00a9e501 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -61,14 +61,17 @@ typedef struct SortCoordinateData *SortCoordinate;
  * Data structures for reporting sort statistics.  Note that
  * TuplesortInstrumentation can't contain any pointers because we
  * sometimes put it in shared memory.
+ *
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
  */
 typedef enum
 {
-	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_STILL_IN_PROGRESS = 1 << 0,
+	SORT_TYPE_TOP_N_HEAPSORT = 1 << 1,
+	SORT_TYPE_QUICKSORT = 1 << 2,
+	SORT_TYPE_EXTERNAL_SORT = 1 << 3,
+	SORT_TYPE_EXTERNAL_MERGE = 1 << 4
 } TuplesortMethod;
 
 typedef enum
@@ -215,6 +218,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +243,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..288a5b2101
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1399 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                 
+------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "top-N heapsort",               +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                                                           explain_analyze_without_memory                                                            
+-----------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 69724d54b9..9ac816177e 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index 331d92708d..f63e71c075 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

v49-0004-ignore-single-key-orderings.patchtext/x-patch; charset=US-ASCII; name=v49-0004-ignore-single-key-orderings.patchDownload

From c0c82add7a9cf58584683abe5d498156424beebd Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Tue, 31 Mar 2020 20:32:08 -0400
Subject: [PATCH v49 4/7] ignore single key orderings

---
 src/backend/optimizer/path/allpaths.c |  7 +++++--
 src/backend/optimizer/plan/planner.c  | 14 +++++++-------
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 93d967e812..b332e474b8 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2842,7 +2842,10 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 			Path	   *subpath = (Path *) lfirst(lc2);
 			GatherMergePath *path;
 
-			/* path has no ordering at all, can't use incremental sort */
+			/*
+			 * If the path has no ordering at all, then we can't use either
+			 * incremental sort or rely on implict sorting with a gather merge.
+			 */
 			if (subpath->pathkeys == NIL)
 				continue;
 
@@ -2907,7 +2910,7 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 			 * Consider incremental sort, but only when the subpath is already
 			 * partially sorted on a pathkey prefix.
 			 */
-			if (enable_incrementalsort && presorted_keys > 0)
+			if (enable_incrementalsort && presorted_keys > 0 && list_length(useful_pathkeys) > 1)
 			{
 				Path	   *tmp;
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 73b7782dcb..6592d0446d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5010,7 +5010,7 @@ create_ordered_paths(PlannerInfo *root,
 				continue;
 
 			/* Likewise, if the path can't be used for incremental sort. */
-			if (!presorted_keys)
+			if (!presorted_keys || list_length(root->sort_pathkeys) == 1)
 				continue;
 
 			/* Also consider incremental sort. */
@@ -5086,7 +5086,7 @@ create_ordered_paths(PlannerInfo *root,
 		 * XXX This is probably duplicate with the paths we already generate
 		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
 		 */
-		if (enable_incrementalsort)
+		if (enable_incrementalsort && list_length(root->sort_pathkeys) > 1)
 		{
 			ListCell   *lc;
 
@@ -6561,7 +6561,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 * when the path is not already sorted and when incremental sort
 			 * is enabled.
 			 */
-			if (is_sorted || !enable_incrementalsort)
+			if (is_sorted || !enable_incrementalsort || list_length(root->group_pathkeys) == 1)
 				continue;
 
 			/* Restore the input path (we might have added Sort on top). */
@@ -6688,7 +6688,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				 * when the path is not already sorted and when incremental
 				 * sort is enabled.
 				 */
-				if (is_sorted || !enable_incrementalsort)
+				if (is_sorted || !enable_incrementalsort || list_length(root->group_pathkeys) == 1)
 					continue;
 
 				/* Restore the input path (we might have added Sort on top). */
@@ -7005,7 +7005,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 		}
 
 		/* Consider incremental sort on all partial paths, if enabled. */
-		if (enable_incrementalsort)
+		if (enable_incrementalsort && list_length(root->group_pathkeys) > 1)
 		{
 			foreach(lc, input_rel->pathlist)
 			{
@@ -7106,7 +7106,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 			 * when the path is not already sorted and when incremental sort
 			 * is enabled.
 			 */
-			if (is_sorted || !enable_incrementalsort)
+			if (is_sorted || !enable_incrementalsort || list_length(root->group_pathkeys) == 1)
 				continue;
 
 			/* Restore the input path (we might have added Sort on top). */
@@ -7278,7 +7278,7 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 		add_path(rel, path);
 	}
 
-	if (!enable_incrementalsort)
+	if (!enable_incrementalsort || list_length(root->group_pathkeys) == 1)
 		return;
 
 	/* also consider incremental sort on partial paths, if enabled */
-- 
2.17.1

v49-0005-remove-dead-function.patchtext/x-patch; charset=US-ASCII; name=v49-0005-remove-dead-function.patchDownload

From 770c9e302850ad18a1b8da18cbdf5d556bac5c11 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Tue, 31 Mar 2020 20:32:22 -0400
Subject: [PATCH v49 5/7] remove dead function

---
 src/backend/optimizer/path/pathkeys.c | 14 --------------
 src/include/optimizer/paths.h         |  1 -
 2 files changed, 15 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71fb790d35..a3f6828436 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -388,20 +388,6 @@ pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
 	return (key1 == NULL);
 }
 
-
-/*
- * pathkeys_common
- *    Returns length of longest common prefix of keys1 and keys2.
- */
-int
-pathkeys_common(List *keys1, List *keys2)
-{
-	int			n;
-
-	(void) pathkeys_common_contained_in(keys1, keys2, &n);
-	return n;
-}
-
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 665f4065a4..c4f77d5137 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -192,7 +192,6 @@ typedef enum
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
 extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
-extern int	pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
-- 
2.17.1

v49-0007-rename-pathkeys_common_contained_in.patchtext/x-patch; charset=US-ASCII; name=v49-0007-rename-pathkeys_common_contained_in.patchDownload

From 16f182bf4bfe071b846a47f157721b91f6fc31bb Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Tue, 31 Mar 2020 21:39:45 -0400
Subject: [PATCH v49 7/7] rename pathkeys_common_contained_in

---
 src/backend/optimizer/path/allpaths.c |  2 +-
 src/backend/optimizer/path/pathkeys.c |  6 +++---
 src/backend/optimizer/plan/planner.c  | 14 +++++++-------
 src/include/optimizer/paths.h         |  2 +-
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b332e474b8..bd786dafd7 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2849,7 +2849,7 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 			if (subpath->pathkeys == NIL)
 				continue;
 
-			is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+			is_sorted = pathkeys_count_contained_in(useful_pathkeys,
 													 subpath->pathkeys,
 													 &presorted_keys);
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index a3f6828436..21e3f5a987 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -335,12 +335,12 @@ pathkeys_contained_in(List *keys1, List *keys2)
 }
 
 /*
- * pathkeys_common_contained_in
+ * pathkeys_count_contained_in
  *    Same as pathkeys_contained_in, but also sets length of longest
  *    common prefix of keys1 and keys2.
  */
 bool
-pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common)
 {
 	int			n = 0;
 	ListCell   *key1,
@@ -1856,7 +1856,7 @@ pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	(void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+	(void) pathkeys_count_contained_in(root->query_pathkeys, pathkeys,
 										&n_common_pathkeys);
 
 	return n_common_pathkeys;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6592d0446d..d6ca896ec3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4972,7 +4972,7 @@ create_ordered_paths(PlannerInfo *root,
 		bool		is_sorted;
 		int			presorted_keys;
 
-		is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+		is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
 												 input_path->pathkeys, &presorted_keys);
 
 		if (is_sorted)
@@ -5105,7 +5105,7 @@ create_ordered_paths(PlannerInfo *root,
 				 * full sort, which is what happens above).
 				 */
 
-				is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+				is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
 														 input_path->pathkeys,
 														 &presorted_keys);
 
@@ -6567,7 +6567,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			/* Restore the input path (we might have added Sort on top). */
 			path = path_original;
 
-			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
 													 path->pathkeys,
 													 &presorted_keys);
 
@@ -6694,7 +6694,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				/* Restore the input path (we might have added Sort on top). */
 				path = path_original;
 
-				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
 														 path->pathkeys,
 														 &presorted_keys);
 
@@ -7013,7 +7013,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 				bool		is_sorted;
 				int			presorted_keys;
 
-				is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
 														 path->pathkeys,
 														 &presorted_keys);
 
@@ -7112,7 +7112,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 			/* Restore the input path (we might have added Sort on top). */
 			path = path_original;
 
-			is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
 													 path->pathkeys,
 													 &presorted_keys);
 
@@ -7289,7 +7289,7 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 		int			presorted_keys;
 		double		total_groups;
 
-		is_sorted = pathkeys_common_contained_in(root->group_pathkeys,
+		is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
 												 path->pathkeys,
 												 &presorted_keys);
 
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index c4f77d5137..c7bd30a8bf 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -191,7 +191,7 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
-extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common);
+extern bool pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
-- 
2.17.1

v49-0006-add-fast-path-to-partial-path-consideration.patchtext/x-patch; charset=US-ASCII; name=v49-0006-add-fast-path-to-partial-path-consideration.patchDownload

From d61827b01b14a1f1fa15ffea20600d539c44ce71 Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Wed, 1 Apr 2020 01:01:01 +0000
Subject: [PATCH v49 6/7] add fast path to partial path consideration

---
 src/backend/optimizer/util/pathnode.c | 83 +++++++++++++++++++++------
 1 file changed, 65 insertions(+), 18 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5e752f64b9..e444aef60a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -779,36 +779,83 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		 * Unless pathkeys are incompatible, see if one of the paths dominates
 		 * the other (both in startup and total cost). It may happen that one
 		 * path has lower startup cost, the other has lower total cost.
-		 *
-		 * XXX Perhaps we could do this only when incremental sort is enabled,
-		 * and use the simpler version (comparing just total cost) otherwise?
 		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			PathCostComparison costcmp;
-
 			/*
-			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 * It's not entirely obvious that we only need to consider startup
+			 * cost when incremental sort is enabled. But doing so saves us ~1%
+			 * of planning time in some worst case scenarios. We have to
+			 * consider startup cost though for incremental sort, because that
+			 * planner option uncovers scenarios where a total higher cost query
+			 * plans over lower cost ones because a lower startup cost but
+			 * higher total cost path is ignored in favor of a higher startup
+			 * cost (but lower total cost plan) before LIMIT optimizations can
+			 * be applied.
 			 */
-			costcmp = compare_path_costs_fuzzily(new_path, old_path,
-												 STD_FUZZ_FACTOR);
-
-			if (costcmp == COSTS_BETTER1)
+			if (enable_incrementalsort)
 			{
-				if (keyscmp == PATHKEYS_BETTER1)
-					remove_old = true;
+				PathCostComparison costcmp;
+
+				/*
+				 * Do a fuzzy cost comparison with standard fuzziness limit.
+				 */
+				costcmp = compare_path_costs_fuzzily(new_path, old_path,
+													 STD_FUZZ_FACTOR);
+
+				if (costcmp == COSTS_BETTER1)
+				{
+					if (keyscmp == PATHKEYS_BETTER1)
+						remove_old = true;
+				}
+				else if (costcmp == COSTS_BETTER2)
+				{
+					if (keyscmp == PATHKEYS_BETTER2)
+						accept_new = false;
+				}
+				else if (costcmp == COSTS_EQUAL)
+				{
+					if (keyscmp == PATHKEYS_BETTER1)
+						remove_old = true;
+					else if (keyscmp == PATHKEYS_BETTER2)
+						accept_new = false;
+				}
 			}
-			else if (costcmp == COSTS_BETTER2)
+			else if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
 			{
-				if (keyscmp == PATHKEYS_BETTER2)
+				/* New path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER1)
 					accept_new = false;
 			}
-			else if (costcmp == COSTS_EQUAL)
+			else if (old_path->total_cost > new_path->total_cost
+					 * STD_FUZZ_FACTOR)
 			{
-				if (keyscmp == PATHKEYS_BETTER1)
+				/* Old path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER2)
 					remove_old = true;
-				else if (keyscmp == PATHKEYS_BETTER2)
-					accept_new = false;
+			}
+			else if (keyscmp == PATHKEYS_BETTER1)
+			{
+				/* Costs are about the same, new path has better pathkeys. */
+				remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER2)
+			{
+				/* Costs are about the same, old path has better pathkeys. */
+				accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
+			{
+				/* Pathkeys are the same, and the old path costs more. */
+				remove_old = true;
+			}
+			else
+			{
+				/*
+				 * Pathkeys are the same, and new path isn't materially
+				 * cheaper.
+				 */
+				accept_new = false;
 			}
 		}
 
-- 
2.17.1

#277

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#276)

6 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Mar 31, 2020 at 11:07 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Mar 31, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 10:12:29PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

One small idea, but I'm not yet sure it helps us a whole lot: if the
query pathkeys is only length 1, then we could skip the additional
path creation.

I don't follow. Why would we create incremental sort in this case at
all? With single-element query_pathkeys the path is either unsorted or
fully sorted - there's no room for incremental sort. No?

Well, we shouldn't, that's what I'm getting. But I didn't see anything
in the code now that explicitly excludes that case when decided
whether or not to create an incremental sort path, unless I'm missing
something obvious.

Well, my point is that create_ordered_paths() looks like this:

is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);

if (is_sorted)
{
... old code
}
else
{
if (input_path == cheapest_input_path)
{
... old code
}

/* With incremental sort disabled, don't build those paths. */
if (!enable_incrementalsort)
continue;

/* Likewise, if the path can't be used for incremental sort. */
if (!presorted_keys)
continue;

... incremental sort path
}

Now, with single-item sort_pathkeys, the input path can't be partially
sorted. It's either fully sorted - in which case it's handled by the
first branch. Or it's not sorted at all, so presorted_keys==0 and we
never get to the incremental path.

Or did you mean to use the optimization somewhere else?

Hmm, yes, I didn't think through that properly. I'll have to look at
the other cases to confirm the same logic applies there.

I looked through this more carefully, and I did end up finding a few
places where we can skip iterating through a list of paths entirely
with this check, so I added it there. I also cleaned up some comments,
added comments and asserts to the other places where
list_length(pathkeys) should be guaranteed to be > 1, removed a few
asserts I found unnecessary, and merged duplicative
pathkeys_[count_]_contained_in calls.

One other thing:in the code above we create the regular sort path
inside of `if (input_path == cheapest_input_path)`, but incremental
sort is outside of that condition. I'm not sure I'm remembering why
that was, and it's not obvious to me reading it right now (though it's
getting late here, so maybe I'm just not thinking clearly). Do you
happen to remember why that is?

It's because for the regular sort, the path is either already sorted or
it requires a full sort. But full sort only makes sense on the cheapest
path, because we assume the additional sort cost is independent of the
input cost, essentially

cost(path + Sort) = cost(path) + cost(Sort)

and it's always

cost(path) + cost(Sort) >= cost(cheapest path) + cost(Sort)

and by checking for cheapest path we simply skip building all the paths
that we'd end up discarding anyway.

With incremental sort we can't do this, the cost of the incremental sort
depends on how well presorted is the input path.

Thanks for the explanation. I've added a comment to that effect.

James

Attachments:

v50-0004-add-fast-path-to-partial-path-consideration.patchtext/x-patch; charset=US-ASCII; name=v50-0004-add-fast-path-to-partial-path-consideration.patchDownload

From 82c72ef60f9e3879a0c3733f38bf1d61843d0824 Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Wed, 1 Apr 2020 01:01:01 +0000
Subject: [PATCH v50 4/6] add fast path to partial path consideration

---
 src/backend/optimizer/util/pathnode.c | 83 +++++++++++++++++++++------
 1 file changed, 65 insertions(+), 18 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5e752f64b9..e444aef60a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -779,36 +779,83 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		 * Unless pathkeys are incompatible, see if one of the paths dominates
 		 * the other (both in startup and total cost). It may happen that one
 		 * path has lower startup cost, the other has lower total cost.
-		 *
-		 * XXX Perhaps we could do this only when incremental sort is enabled,
-		 * and use the simpler version (comparing just total cost) otherwise?
 		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			PathCostComparison costcmp;
-
 			/*
-			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 * It's not entirely obvious that we only need to consider startup
+			 * cost when incremental sort is enabled. But doing so saves us ~1%
+			 * of planning time in some worst case scenarios. We have to
+			 * consider startup cost though for incremental sort, because that
+			 * planner option uncovers scenarios where a total higher cost query
+			 * plans over lower cost ones because a lower startup cost but
+			 * higher total cost path is ignored in favor of a higher startup
+			 * cost (but lower total cost plan) before LIMIT optimizations can
+			 * be applied.
 			 */
-			costcmp = compare_path_costs_fuzzily(new_path, old_path,
-												 STD_FUZZ_FACTOR);
-
-			if (costcmp == COSTS_BETTER1)
+			if (enable_incrementalsort)
 			{
-				if (keyscmp == PATHKEYS_BETTER1)
-					remove_old = true;
+				PathCostComparison costcmp;
+
+				/*
+				 * Do a fuzzy cost comparison with standard fuzziness limit.
+				 */
+				costcmp = compare_path_costs_fuzzily(new_path, old_path,
+													 STD_FUZZ_FACTOR);
+
+				if (costcmp == COSTS_BETTER1)
+				{
+					if (keyscmp == PATHKEYS_BETTER1)
+						remove_old = true;
+				}
+				else if (costcmp == COSTS_BETTER2)
+				{
+					if (keyscmp == PATHKEYS_BETTER2)
+						accept_new = false;
+				}
+				else if (costcmp == COSTS_EQUAL)
+				{
+					if (keyscmp == PATHKEYS_BETTER1)
+						remove_old = true;
+					else if (keyscmp == PATHKEYS_BETTER2)
+						accept_new = false;
+				}
 			}
-			else if (costcmp == COSTS_BETTER2)
+			else if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
 			{
-				if (keyscmp == PATHKEYS_BETTER2)
+				/* New path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER1)
 					accept_new = false;
 			}
-			else if (costcmp == COSTS_EQUAL)
+			else if (old_path->total_cost > new_path->total_cost
+					 * STD_FUZZ_FACTOR)
 			{
-				if (keyscmp == PATHKEYS_BETTER1)
+				/* Old path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER2)
 					remove_old = true;
-				else if (keyscmp == PATHKEYS_BETTER2)
-					accept_new = false;
+			}
+			else if (keyscmp == PATHKEYS_BETTER1)
+			{
+				/* Costs are about the same, new path has better pathkeys. */
+				remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER2)
+			{
+				/* Costs are about the same, old path has better pathkeys. */
+				accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
+			{
+				/* Pathkeys are the same, and the old path costs more. */
+				remove_old = true;
+			}
+			else
+			{
+				/*
+				 * Pathkeys are the same, and new path isn't materially
+				 * cheaper.
+				 */
+				accept_new = false;
 			}
 		}
 
-- 
2.17.1

v50-0003-comment.patchtext/x-patch; charset=US-ASCII; name=v50-0003-comment.patchDownload

From 1805c68b0ae14382952c49b77ccbe30cd7605baa Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Wed, 1 Apr 2020 09:04:39 -0400
Subject: [PATCH v50 3/6] comment

---
 src/backend/optimizer/plan/planner.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cc4718d1c9..aeb83841d7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4986,6 +4986,11 @@ create_ordered_paths(PlannerInfo *root,
 		}
 		else
 		{
+			/*
+			 * Try adding an explicit sort, but only to the cheapest total path
+			 * since a full sort should generally add the same cost to all
+			 * paths.
+			 */
 			if (input_path == cheapest_input_path)
 			{
 				/*
@@ -5005,7 +5010,13 @@ create_ordered_paths(PlannerInfo *root,
 				add_path(ordered_rel, sorted_path);
 			}
 
-			/* With incremental sort disabled, don't build those paths. */
+			/*
+			 * If incremental sort is enabled, then try it as well. Unlike with
+			 * regular sorts, we can't just look at the cheapest path, because
+			 * the cost of incremental sort depends on how well presorted the
+			 * path is. Additionally incremental sort may enable a cheaper
+			 * startup path to win out despite higher total cost.
+			 */
 			if (!enable_incrementalsort)
 				continue;
 
-- 
2.17.1

v50-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v50-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 187024ae1f0c3888de4cdf3d4628c099a929d66b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v50 1/6] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v50-0005-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v50-0005-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From be34f747fdf1250a8454503312d27e42cf73f8a2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v50 5/6] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 --
 src/backend/optimizer/geqo/geqo_eval.c  |   2 +-
 src/backend/optimizer/path/allpaths.c   | 208 +++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++
 src/backend/optimizer/plan/planner.c    | 346 +++++++++++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 6 files changed, 580 insertions(+), 36 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..006924d4a6 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,210 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future. For example, we might want to consider pathkeys useful for
+ * merge joins.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		ListCell   *lc;
+		List	   *pathkeys = NIL;
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+
+			/*
+			 * We can only build an Incremental Sort for pathkeys which contain
+			 * an EC member in the current relation, so ignore any suffix of the
+			 * list as soon as we find a pathkey without an EC member the
+			 * relation.
+			 *
+			 * By still returning the prefix of the pathkeys list that does meet
+			 * criteria of EC membership in the current relation, we enable not
+			 * just an incremental sort on the entirety of query_pathkeys but
+			 * also incremental sort below a JOIN.
+			 */
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
+				break;
+
+			pathkeys = lappend(pathkeys, pathkey);
+		}
+
+		if (pathkeys)
+			useful_pathkeys_list = lappend(useful_pathkeys_list, pathkeys);
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/* path has no ordering at all, can't use incremental sort */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_count_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			Assert(!is_sorted);
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3103,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index aeb83841d7..1cfbd88eec 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5090,6 +5090,67 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 */
+		if (enable_incrementalsort)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6444,7 +6505,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6503,6 +6566,80 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, no point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6514,12 +6651,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6550,6 +6693,56 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* We've already skipped fully sorted paths above. */
+				Assert(!is_sorted);
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6821,6 +7014,58 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/* Consider incremental sort on all partial paths, if enabled. */
+		if (enable_incrementalsort)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6829,7 +7074,9 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
 
 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
 											  path->pathkeys);
@@ -6864,6 +7111,56 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
+			/* We've already skipped fully sorted paths above. */
+			Assert(!is_sorted);
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -6961,10 +7258,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -6990,6 +7288,46 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	if (!enable_incrementalsort)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7091,7 +7429,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
@@ -7245,7 +7583,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index ed50092bc7..c7bd30a8bf 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.17.1

v50-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v50-0002-Implement-incremental-sort.patchDownload

From d66d96bdfd9079facb81a1d3e8b836704eab56e0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v50 2/6] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 doc/src/sgml/perform.sgml                     |   42 +-
 src/backend/commands/explain.c                |  239 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1263 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   72 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   74 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  306 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    1 +
 src/include/utils/tuplesort.h                 |   16 +-
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1399 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 41 files changed, 4203 insertions(+), 166 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..675059953b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4554,6 +4554,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort steps.
+        The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index ab090441cf..ee8933861c 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -291,7 +291,47 @@ EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42;
     often see this plan type for queries that fetch just a single row.  It's
     also often used for queries that have an <literal>ORDER BY</literal> condition
     that matches the index order, because then no extra sorting step is needed
-    to satisfy the <literal>ORDER BY</literal>.
+    to satisfy the <literal>ORDER BY</literal>.  In this example, adding
+    <literal>ORDER BY unique1</literal> would use the same plan because the
+    index already implicitly provides the requested ordering.
+   </para>
+
+   <para>
+     The planner may implement an <literal>ORDER BY</literal> clause in several
+     ways.  The above example shows that such an ordering clause may be
+     implemented implicitly.  The planner may also add an explicit
+     <literal>sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY unique1;
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort  (cost=1109.39..1134.39 rows=10000 width=244)
+   Sort Key: unique1
+   ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
+</screen>
+
+    If the a part of the plan guarantess an ordering on a prefix of the
+    required sort keys, then the planner may instead decide to use an
+    <literal>incremental sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY four, ten LIMIT 100;
+                                              QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ Limit  (cost=521.06..538.05 rows=100 width=244)
+   ->  Incremental Sort  (cost=521.06..2220.95 rows=10000 width=244)
+         Sort Key: four, ten
+         Presorted Key: four
+         ->  Index Scan using index_tenk1_on_four on tenk1  (cost=0.29..1510.08 rows=10000 width=244)
+</screen>
+
+    Compared to regular sorts, sorting incrementally allows returning tuples
+    before the entire result set has been sorted, which particularly enables
+    optimizations with <literal>LIMIT</literal> queries.  It may also reduce
+    memory usage and the likelihood of spilling sorts to disk, but it comes at
+    the cost of the increased overhead of splitting the result set into multiple
+    sorting batches.
    </para>
 
    <para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee0e638f33..8aa45a719c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->nPresortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,196 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, bool indent, ExplainState *es)
+{
+	ListCell   *methodCell;
+	List	   *methodNames = NIL;
+
+	/* Generate a list of sort methods used across all groups. */
+	for (int bit = 0; bit < sizeof(bits32); ++bit)
+	{
+		if (groupInfo->sortMethods & (1 << bit))
+		{
+			TuplesortMethod sortMethod = (1 << bit);
+			const char *methodName;
+
+			methodName = tuplesort_method_name(sortMethod);
+			methodNames = lappend(methodNames, unconstify(char *, methodName));
+		}
+	}
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (indent)
+			appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld Sort Method", groupLabel,
+						 groupInfo->groupCount);
+		/* plural/singular based on methodNames size */
+		if (list_length(methodNames) > 1)
+			appendStringInfo(es->str, "s: ");
+		else
+			appendStringInfo(es->str, ": ");
+		foreach(methodCell, methodNames)
+		{
+			appendStringInfo(es->str, "%s", (char *) methodCell->ptr_value);
+			if (foreach_current_index(methodCell) < list_length(methodNames) - 1)
+				appendStringInfo(es->str, ", ");
+		}
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+	}
+	else
+	{
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+			appendStringInfo(es->str, " ");
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+	}
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+		appendStringInfo(es->str, "\n");
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		indent_first_line;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
+
+			indent_first_line = es->workers_state == NULL || es->verbose;
+			show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+											 indent_first_line, es);
+			if (prefixsortGroupInfo->groupCount > 0)
+			{
+				if (es->format == EXPLAIN_FORMAT_TEXT)
+					appendStringInfo(es->str, " ");
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+			}
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "\n");
+
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..bcab7c054c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1263 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * We need to store the instrumentation information in either local node's sort
+ * info or, for a parallel worker process, in the shared info (this avoids
+ * having to additionally memcpy the info from local memory to shared memory
+ * at each instrumentation call). This macro expands to choose the proper sort
+ * state and group info.
+ *
+ * Arguments:
+ * - node: type IncrementalSortState *
+ * - groupName: the token fullsort or prefixsort
+ */
+#define INSTRUMENT_SORT_GROUP(node, groupName) \
+	if (node->ss.ps.instrument != NULL) \
+	{ \
+		if (node->shared_info && node->am_worker) \
+		{ \
+			Assert(IsParallelWorker()); \
+			Assert(ParallelWorkerNumber <= node->shared_info->num_workers); \
+			instrumentSortedGroup(&node->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, node->groupName##_state); \
+		} else { \
+			instrumentSortedGroup(&node->incsort_info.groupName##GroupInfo, node->groupName##_state); \
+		} \
+	}
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	TuplesortInstrumentation sort_instr;
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	groupInfo->sortMethods |= sort_instr.sortMethod;
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->nPresortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->nPresortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			nPresortedCols;
+
+	nPresortedCols = castNode(IncrementalSort, node->ss.ps.plan)->nPresortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = nPresortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			nPresortedCols = plannode->nPresortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - nPresortedCols,
+												&(plannode->sort.sortColIdx[nPresortedCols]),
+												&(plannode->sort.sortOperators[nPresortedCols]),
+												&(plannode->sort.collations[nPresortedCols]),
+												&(plannode->sort.nullsFirst[nPresortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					INSTRUMENT_SORT_GROUP(node, fullsort)
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = 0;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = 0;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c9a90d1191..29da0a6fbb 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(nPresortedCols);
 
 	return newnode;
 }
@@ -4896,6 +4930,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index eb168ffd6d..f1271b6aca 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(nPresortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3784,6 +3800,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..2a2f39bf04 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(nPresortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 9e7e57f118..8a52271692 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..21e3f5a987 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,60 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * pathkeys_count_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	/*
+	 * See if we can avoiding looping through both lists. This optimization
+	 * gains us several percent in planning time in a worst-case test.
+	 */
+	if (keys1 == keys2)
+	{
+		*n_common = list_length(keys1);
+		return true;
+	}
+	else if (keys1 == NIL)
+	{
+		*n_common = 0;
+		return true;
+	}
+	else if (keys2 == NIL)
+	{
+		*n_common = 0;
+		return false;
+	}
+
+	/*
+	 * If both lists are non-empty, iterate through both to find out how many
+	 * items are shared.
+	 */
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	/* If we ended with a null value, then we've processed the whole list. */
+	*n_common = n;
+	return (key1 == NULL);
+}
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1840,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_count_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..5be9135646 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int nPresortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int nPresortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->nPresortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int nPresortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->nPresortedCols = nPresortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'nPresortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int nPresortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, nPresortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f52226ccec..cc4718d1c9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4924,13 +4924,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4964,29 +4967,66 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/* With incremental sort disabled, don't build those paths. */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..5e752f64b9 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2750,6 +2750,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->nPresortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..fe87d549d9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..427e5e967e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -358,6 +358,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..cc33a85731 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1296,23 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1370,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2764,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2814,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3311,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0fb5d61a3f..fb490b404c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1982,6 +1982,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2010,6 +2025,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instrumentation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	bits32		sortMethods; /* bitmask of TuplesortMethod */
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 5334a73b53..bb2cb70709 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1621,6 +1621,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..be8ef54a1e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..ed50092bc7 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..8d00a9e501 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -61,14 +61,17 @@ typedef struct SortCoordinateData *SortCoordinate;
  * Data structures for reporting sort statistics.  Note that
  * TuplesortInstrumentation can't contain any pointers because we
  * sometimes put it in shared memory.
+ *
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
  */
 typedef enum
 {
-	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_STILL_IN_PROGRESS = 1 << 0,
+	SORT_TYPE_TOP_N_HEAPSORT = 1 << 1,
+	SORT_TYPE_QUICKSORT = 1 << 2,
+	SORT_TYPE_EXTERNAL_SORT = 1 << 3,
+	SORT_TYPE_EXTERNAL_MERGE = 1 << 4
 } TuplesortMethod;
 
 typedef enum
@@ -215,6 +218,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +243,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..288a5b2101
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1399 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                 
+------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "top-N heapsort",               +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                                                           explain_analyze_without_memory                                                            
+-----------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 69724d54b9..9ac816177e 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index 331d92708d..f63e71c075 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

v50-0006-cleanup.patchtext/x-patch; charset=US-ASCII; name=v50-0006-cleanup.patchDownload

From 753184e345638b0c681ba6ba1d4b3a2bf7b4c570 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Wed, 1 Apr 2020 08:53:18 -0400
Subject: [PATCH v50 6/6] cleanup

---
 src/backend/optimizer/path/allpaths.c | 13 ++++-
 src/backend/optimizer/plan/planner.c  | 81 +++++++++++++++++----------
 2 files changed, 60 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 006924d4a6..4a72dcf8bc 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2842,7 +2842,10 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 			Path	   *subpath = (Path *) lfirst(lc2);
 			GatherMergePath *path;
 
-			/* path has no ordering at all, can't use incremental sort */
+			/*
+			 * If the path has no ordering at all, then we can't use either
+			 * incremental sort or rely on implict sorting with a gather merge.
+			 */
 			if (subpath->pathkeys == NIL)
 				continue;
 
@@ -2867,8 +2870,6 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 				continue;
 			}
 
-			Assert(!is_sorted);
-
 			/*
 			 * Consider regular sort for the cheapest partial path (for each
 			 * useful pathkeys). We know the path is not sorted, because we'd
@@ -2911,6 +2912,12 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 			{
 				Path	   *tmp;
 
+				/*
+				 * We should have already excluded pathkeys of length 1 because
+				 * then presorted_keys > 0 would imply is_sorted was true.
+				 */
+				Assert(list_length(useful_pathkeys) != 1);
+
 				tmp = (Path *) create_incremental_sort_path(root,
 															rel,
 															subpath,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1cfbd88eec..9608fdaec8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5096,8 +5096,12 @@ create_ordered_paths(PlannerInfo *root,
 		 *
 		 * XXX This is probably duplicate with the paths we already generate
 		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 *
+		 * We can also skip the entire loop when we only have a single-item
+		 * sort_pathkeys because then we can't possibly have a presorted
+		 * prefix of the list without having the list be fully sorted.
 		 */
-		if (enable_incrementalsort)
+		if (enable_incrementalsort && list_length(root->sort_pathkeys) > 1)
 		{
 			ListCell   *lc;
 
@@ -6509,8 +6513,10 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			bool		is_sorted;
 			int			presorted_keys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
 			if (path == cheapest_path || is_sorted)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
@@ -6578,17 +6584,16 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			/* Restore the input path (we might have added Sort on top). */
 			path = path_original;
 
-			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
-													 path->pathkeys,
-													 &presorted_keys);
-
-			/* We've already skipped fully sorted paths above. */
-			Assert(!is_sorted);
-
 			/* no shared prefix, no point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
+			/*
+			 * We should have already excluded pathkeys of length 1 because
+			 * then presorted_keys > 0 would imply is_sorted was true.
+			 */
+			Assert(list_length(root->group_pathkeys) != 1);
+
 			path = (Path *) create_incremental_sort_path(root,
 														 grouped_rel,
 														 path,
@@ -6655,8 +6660,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				bool		is_sorted;
 				int			presorted_keys;
 
-				is_sorted = pathkeys_contained_in(root->group_pathkeys,
-												  path->pathkeys);
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
@@ -6705,17 +6711,16 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				/* Restore the input path (we might have added Sort on top). */
 				path = path_original;
 
-				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
-														 path->pathkeys,
-														 &presorted_keys);
-
-				/* We've already skipped fully sorted paths above. */
-				Assert(!is_sorted);
-
 				/* no shared prefix, not point in building incremental sort */
 				if (presorted_keys == 0)
 					continue;
 
+				/*
+				 * We should have already excluded pathkeys of length 1 because
+				 * then presorted_keys > 0 would imply is_sorted was true.
+				 */
+				Assert(list_length(root->group_pathkeys) != 1);
+
 				path = (Path *) create_incremental_sort_path(root,
 															 grouped_rel,
 															 path,
@@ -7015,8 +7020,14 @@ create_partial_grouping_paths(PlannerInfo *root,
 			}
 		}
 
-		/* Consider incremental sort on all partial paths, if enabled. */
-		if (enable_incrementalsort)
+		/*
+		 * Consider incremental sort on all partial paths, if enabled.
+		 *
+		 * We can also skip the entire loop when we only have a single-item
+		 * group_pathkeys because then we can't possibly have a presorted
+		 * prefix of the list without having the list be fully sorted.
+		 */
+		if (enable_incrementalsort && list_length(root->group_pathkeys) > 1)
 		{
 			foreach(lc, input_rel->pathlist)
 			{
@@ -7078,8 +7089,10 @@ create_partial_grouping_paths(PlannerInfo *root,
 			bool		is_sorted;
 			int			presorted_keys;
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
+
 			if (path == cheapest_partial_path || is_sorted)
 			{
 				/* Sort the cheapest partial path, if it isn't already */
@@ -7123,17 +7136,16 @@ create_partial_grouping_paths(PlannerInfo *root,
 			/* Restore the input path (we might have added Sort on top). */
 			path = path_original;
 
-			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
-													 path->pathkeys,
-													 &presorted_keys);
-
-			/* We've already skipped fully sorted paths above. */
-			Assert(!is_sorted);
-
 			/* no shared prefix, not point in building incremental sort */
 			if (presorted_keys == 0)
 				continue;
 
+			/*
+			 * We should have already excluded pathkeys of length 1 because
+			 * then presorted_keys > 0 would imply is_sorted was true.
+			 */
+			Assert(list_length(root->group_pathkeys) != 1);
+
 			path = (Path *) create_incremental_sort_path(root,
 														 partially_grouped_rel,
 														 path,
@@ -7289,7 +7301,14 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 		add_path(rel, path);
 	}
 
-	if (!enable_incrementalsort)
+	/*
+	 * Consider incremental sort on all partial paths, if enabled.
+	 *
+	 * We can also skip the entire loop when we only have a single-item
+	 * group_pathkeys because then we can't possibly have a presorted
+	 * prefix of the list without having the list be fully sorted.
+	 */
+	if (!enable_incrementalsort || list_length(root->group_pathkeys) == 1)
 		return;
 
 	/* also consider incremental sort on partial paths, if enabled */
-- 
2.17.1

#278

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#277)

3 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Apr 01, 2020 at 09:05:27AM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 11:07 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Mar 31, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 10:12:29PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

One small idea, but I'm not yet sure it helps us a whole lot: if the
query pathkeys is only length 1, then we could skip the additional
path creation.

I don't follow. Why would we create incremental sort in this case at
all? With single-element query_pathkeys the path is either unsorted or
fully sorted - there's no room for incremental sort. No?

Well, we shouldn't, that's what I'm getting. But I didn't see anything
in the code now that explicitly excludes that case when decided
whether or not to create an incremental sort path, unless I'm missing
something obvious.

Well, my point is that create_ordered_paths() looks like this:

is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);

if (is_sorted)
{
... old code
}
else
{
if (input_path == cheapest_input_path)
{
... old code
}

/* With incremental sort disabled, don't build those paths. */
if (!enable_incrementalsort)
continue;

/* Likewise, if the path can't be used for incremental sort. */
if (!presorted_keys)
continue;

... incremental sort path
}

Now, with single-item sort_pathkeys, the input path can't be partially
sorted. It's either fully sorted - in which case it's handled by the
first branch. Or it's not sorted at all, so presorted_keys==0 and we
never get to the incremental path.

Or did you mean to use the optimization somewhere else?

Hmm, yes, I didn't think through that properly. I'll have to look at
the other cases to confirm the same logic applies there.

I looked through this more carefully, and I did end up finding a few
places where we can skip iterating through a list of paths entirely
with this check, so I added it there. I also cleaned up some comments,
added comments and asserts to the other places where
list_length(pathkeys) should be guaranteed to be > 1, removed a few
asserts I found unnecessary, and merged duplicative
pathkeys_[count_]_contained_in calls.

One other thing:in the code above we create the regular sort path
inside of `if (input_path == cheapest_input_path)`, but incremental
sort is outside of that condition. I'm not sure I'm remembering why
that was, and it's not obvious to me reading it right now (though it's
getting late here, so maybe I'm just not thinking clearly). Do you
happen to remember why that is?

It's because for the regular sort, the path is either already sorted or
it requires a full sort. But full sort only makes sense on the cheapest
path, because we assume the additional sort cost is independent of the
input cost, essentially

cost(path + Sort) = cost(path) + cost(Sort)

and it's always

cost(path) + cost(Sort) >= cost(cheapest path) + cost(Sort)

and by checking for cheapest path we simply skip building all the paths
that we'd end up discarding anyway.

With incremental sort we can't do this, the cost of the incremental sort
depends on how well presorted is the input path.

Thanks for the explanation. I've added a comment to that effect.

Thanks.

I've realized the way get_useful_pathkeys_for_relation() is coded kinda
works against the fastpath we added when comparing pathkeys. That
depends on comparing pointers to the list, but we've been building new
lists (and then returned those) which defeats the optimization. Attached
is a patch that returns the original list in most cases (and only
creates a copy when really necessary). This might also save a few cycles
on bulding the new list, of course.

I've done a bunch of read-only pgbench tests with fairly small scales (1
and 10). First with the built-in read-only transaction, and also with a
simple custom query doing an order-by. And I did this both on the
default schema and with a bunch of extra indexes. The script I used to
run this is attached, along with a summary of results.

There are results for master and v40 and v50 patches (the v50 also
includes the extra patch fixing get_useful_pathkeys_for_relation).

Overall, I'm happy with those results - the v50 seems to be within 1% of
master, in both directions. This very much seems like a noise.

I still want to do a bit more review of the costing / tuplesort changes,
which I plan to do tomorrow. If that goes well, I plan to start
committing this. So please if you think this is not ready or wants more
time for a review, let me know. I'm not yet sure if I'll commit this as
a single change, or in three separate commits.

James, can you review the proposed extra fix and merge the fixes into
the main patches?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

run-pgbench.shapplication/x-shDownload

0007-extra-patch.patchtext/plain; charset=us-asciiDownload

From 6da44715c992718782aa634e16df10ee1fe7ecdf Mon Sep 17 00:00:00 2001
From: tt <tt>
Date: Wed, 1 Apr 2020 18:08:52 +0200
Subject: [PATCH 7/7] extra patch

---
 src/backend/optimizer/path/allpaths.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 4a72dcf8bc..0534bb24c5 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2757,7 +2757,7 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 	if (root->query_pathkeys)
 	{
 		ListCell   *lc;
-		List	   *pathkeys = NIL;
+		int		npathkeys = 0;	/* useful pathkeys */
 
 		foreach(lc, root->query_pathkeys)
 		{
@@ -2778,11 +2778,21 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
 			if (!find_em_expr_for_rel(pathkey_ec, rel))
 				break;
 
-			pathkeys = lappend(pathkeys, pathkey);
+			npathkeys++;
 		}
 
-		if (pathkeys)
-			useful_pathkeys_list = lappend(useful_pathkeys_list, pathkeys);
+		/*
+		 * The whole query_pathkeys list matches, so append it directly, to allow
+		 * comparing pathkeys easily by comparing list pointer. If we have to truncate
+		 * the pathkeys, we gotta do a copy though.
+		 */
+		if (npathkeys == list_length(root->query_pathkeys))
+			useful_pathkeys_list = lappend(useful_pathkeys_list,
+										   root->query_pathkeys);
+		else if (npathkeys > 0)
+			useful_pathkeys_list = lappend(useful_pathkeys_list,
+										   list_truncate(list_copy(root->query_pathkeys),
+														 npathkeys));
 	}
 
 	return useful_pathkeys_list;
-- 
2.24.1

results-is.odsapplication/vnd.oasis.opendocument.spreadsheetDownload

#279

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#278)

4 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Apr 1, 2020 at 5:42 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...
I've realized the way get_useful_pathkeys_for_relation() is coded kinda
works against the fastpath we added when comparing pathkeys. That
depends on comparing pointers to the list, but we've been building new
lists (and then returned those) which defeats the optimization. Attached
is a patch that returns the original list in most cases (and only
creates a copy when really necessary). This might also save a few cycles
on bulding the new list, of course.

I've done a bunch of read-only pgbench tests with fairly small scales (1
and 10). First with the built-in read-only transaction, and also with a
simple custom query doing an order-by. And I did this both on the
default schema and with a bunch of extra indexes. The script I used to
run this is attached, along with a summary of results.

There are results for master and v40 and v50 patches (the v50 also
includes the extra patch fixing get_useful_pathkeys_for_relation).

Overall, I'm happy with those results - the v50 seems to be within 1% of
master, in both directions. This very much seems like a noise.

I still want to do a bit more review of the costing / tuplesort changes,
which I plan to do tomorrow. If that goes well, I plan to start
committing this. So please if you think this is not ready or wants more

I think we need to either implement this or remove the comment:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.
in generate_useful_gather_paths().

In the same function we have the following code:
/*
* When the partial path is already sorted, we can just add a gather
* merge on top, and we're done - no point in adding explicit sort.
*
* XXX Can't we skip this (maybe only for the cheapest partial path)
* when the path is already sorted? Then it's likely duplicate with
* the path created by generate_gather_paths.
*/
if (is_sorted)
{
path = create_gather_merge_path(root, rel, subpath, rel->reltarget,

subpath->pathkeys, NULL, rowsp);

add_path(rel, &path->path);
continue;
}

looking at the relevant loop in generate_gather_paths:
/*
* For each useful ordering, we can consider an order-preserving Gather
* Merge.
*/
foreach(lc, rel->partial_pathlist)
{
Path *subpath = (Path *) lfirst(lc);
GatherMergePath *path;

if (subpath->pathkeys == NIL)
continue;

rows = subpath->rows * subpath->parallel_workers;
path = create_gather_merge_path(root, rel, subpath, rel->reltarget,

subpath->pathkeys, NULL, rowsp);
add_path(rel, &path->path);
}

I believe we can eliminate the block entirely in
generate_useful_gather_paths(). Here's my reasoning: all paths for
which is_sorted is true must necessarily have pathkeys, and since we
already add a gather merge for every subpath with pathkeys, we've
already added gather merge paths for all of these.

I've included a patch to change this, but let me know if the reasoning
isn't sound.

We can also remove the XXX on this comment (in the same function):
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

because of this code in generate_gather_paths():
cheapest_partial_path = linitial(rel->partial_pathlist);
rows =
cheapest_partial_path->rows * cheapest_partial_path->parallel_workers;
simple_gather_path = (Path *)
create_gather_path(root, rel, cheapest_partial_path, rel->reltarget,
NULL, rowsp);
add_path(rel, simple_gather_path);

but we can cleanup the comment a bit: fix the grammar issue in the
last line and fix the reference to gather merge path (it's a gather
path).

I've included that in the same patch.

I also noticed that in create_incremental_sort_path we have this:
/* XXX comparison_cost shouldn't be 0? */
but I guess that's part of what you're reviewing tomorrow.

time for a review, let me know. I'm not yet sure if I'll commit this as
a single change, or in three separate commits.

I don't love the idea of committing it as a single patch, but at least
the first two I think probably go together. Otherwise we're
introducing a "fix" with no proven impact that will slow down planning
(even if only in a small way) only to intend to condition that on a
GUC in the next commit.

But I think you could potentially make an argument for keeping the
additional paths separate...but it's not absolutely necessary IMO.

James, can you review the proposed extra fix and merge the fixes into
the main patches?

I've reviewed it, and it looks correct, so merged into the main series.

Summary:
The attached series includes a couple of XXX fixes or comment cleanup
as noted above. I believe there are two more XXXs that needs to be
answered before we merge ("do we need to consider adding a projection"
and "what is the comparison cost for incremental sort").

James

Attachments:

v51-0003-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v51-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From 126e3ee4a09439cccbee018034753e7e5012773f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v51 3/4] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 --
 src/backend/optimizer/geqo/geqo_eval.c  |   2 +-
 src/backend/optimizer/path/allpaths.c   | 225 +++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++
 src/backend/optimizer/plan/planner.c    | 373 +++++++++++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 6 files changed, 620 insertions(+), 40 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..0534bb24c5 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,227 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future. For example, we might want to consider pathkeys useful for
+ * merge joins.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		ListCell   *lc;
+		int		npathkeys = 0;	/* useful pathkeys */
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+
+			/*
+			 * We can only build an Incremental Sort for pathkeys which contain
+			 * an EC member in the current relation, so ignore any suffix of the
+			 * list as soon as we find a pathkey without an EC member the
+			 * relation.
+			 *
+			 * By still returning the prefix of the pathkeys list that does meet
+			 * criteria of EC membership in the current relation, we enable not
+			 * just an incremental sort on the entirety of query_pathkeys but
+			 * also incremental sort below a JOIN.
+			 */
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
+				break;
+
+			npathkeys++;
+		}
+
+		/*
+		 * The whole query_pathkeys list matches, so append it directly, to allow
+		 * comparing pathkeys easily by comparing list pointer. If we have to truncate
+		 * the pathkeys, we gotta do a copy though.
+		 */
+		if (npathkeys == list_length(root->query_pathkeys))
+			useful_pathkeys_list = lappend(useful_pathkeys_list,
+										   root->query_pathkeys);
+		else if (npathkeys > 0)
+			useful_pathkeys_list = lappend(useful_pathkeys_list,
+										   list_truncate(list_copy(root->query_pathkeys),
+														 npathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 *
+	 * XXX I wonder if we need to consider adding a projection here, as
+	 * create_ordered_paths does.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/*
+			 * If the path has no ordering at all, then we can't use either
+			 * incremental sort or rely on implict sorting with a gather merge.
+			 */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_count_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * When the partial path is already sorted, we can just add a gather
+			 * merge on top, and we're done - no point in adding explicit sort.
+			 *
+			 * XXX Can't we skip this (maybe only for the cheapest partial path)
+			 * when the path is already sorted? Then it's likely duplicate with
+			 * the path created by generate_gather_paths.
+			 */
+			if (is_sorted)
+			{
+				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
+												subpath->pathkeys, NULL, rowsp);
+
+				add_path(rel, &path->path);
+				continue;
+			}
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * XXX This is not redundant with the gather merge path created in
+			 * generate_gather_paths, because that merely preserves ordering of
+			 * the cheapest partial path, while here we add an explicit sort to
+			 * get match the useful ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				/*
+				 * We should have already excluded pathkeys of length 1 because
+				 * then presorted_keys > 0 would imply is_sorted was true.
+				 */
+				Assert(list_length(useful_pathkeys) != 1);
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3120,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index aeb83841d7..9608fdaec8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5090,6 +5090,71 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 *
+		 * We can also skip the entire loop when we only have a single-item
+		 * sort_pathkeys because then we can't possibly have a presorted
+		 * prefix of the list without having the list be fully sorted.
+		 */
+		if (enable_incrementalsort && list_length(root->sort_pathkeys) > 1)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6444,10 +6509,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
 			if (path == cheapest_path || is_sorted)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
@@ -6503,6 +6572,79 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			/* no shared prefix, no point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			/*
+			 * We should have already excluded pathkeys of length 1 because
+			 * then presorted_keys > 0 would imply is_sorted was true.
+			 */
+			Assert(list_length(root->group_pathkeys) != 1);
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6514,12 +6656,19 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6550,6 +6699,55 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				/*
+				 * We should have already excluded pathkeys of length 1 because
+				 * then presorted_keys > 0 would imply is_sorted was true.
+				 */
+				Assert(list_length(root->group_pathkeys) != 1);
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6821,6 +7019,64 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Consider incremental sort on all partial paths, if enabled.
+		 *
+		 * We can also skip the entire loop when we only have a single-item
+		 * group_pathkeys because then we can't possibly have a presorted
+		 * prefix of the list without having the list be fully sorted.
+		 */
+		if (enable_incrementalsort && list_length(root->group_pathkeys) > 1)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6829,10 +7085,14 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
 			if (path == cheapest_partial_path || is_sorted)
 			{
 				/* Sort the cheapest partial path, if it isn't already */
@@ -6864,6 +7124,55 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			/*
+			 * We should have already excluded pathkeys of length 1 because
+			 * then presorted_keys > 0 would imply is_sorted was true.
+			 */
+			Assert(list_length(root->group_pathkeys) != 1);
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -6961,10 +7270,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -6990,6 +7300,53 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	/*
+	 * Consider incremental sort on all partial paths, if enabled.
+	 *
+	 * We can also skip the entire loop when we only have a single-item
+	 * group_pathkeys because then we can't possibly have a presorted
+	 * prefix of the list without having the list be fully sorted.
+	 */
+	if (!enable_incrementalsort || list_length(root->group_pathkeys) == 1)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7091,7 +7448,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
@@ -7245,7 +7602,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index ed50092bc7..c7bd30a8bf 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.17.1

v51-0004-cleanup-xxx-in-generate_useful_gather_paths.patchtext/x-patch; charset=US-ASCII; name=v51-0004-cleanup-xxx-in-generate_useful_gather_paths.patchDownload

From da7fce78e6e5e375676423bd5b11eb2cb5994d8f Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Wed, 1 Apr 2020 21:59:41 -0400
Subject: [PATCH v51 4/4] cleanup xxx in generate_useful_gather_paths

---
 src/backend/optimizer/path/allpaths.c | 25 ++++++++++---------------
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 0534bb24c5..9b70d091ae 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2864,31 +2864,26 @@ generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_r
 													 &presorted_keys);
 
 			/*
-			 * When the partial path is already sorted, we can just add a gather
-			 * merge on top, and we're done - no point in adding explicit sort.
+			 * We don't need to consider the case where a subpath is already
+			 * fully sorted because generate_gather_paths already creates a
+			 * gather merge path for every subpath that has pathkeys present.
 			 *
-			 * XXX Can't we skip this (maybe only for the cheapest partial path)
-			 * when the path is already sorted? Then it's likely duplicate with
-			 * the path created by generate_gather_paths.
+			 * But since the subpath is already sorted, we know we don't need
+			 * to consider adding a sort (other either kind) on top of it, so
+			 * we can continue here.
 			 */
 			if (is_sorted)
-			{
-				path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
-												subpath->pathkeys, NULL, rowsp);
-
-				add_path(rel, &path->path);
 				continue;
-			}
 
 			/*
 			 * Consider regular sort for the cheapest partial path (for each
 			 * useful pathkeys). We know the path is not sorted, because we'd
 			 * not get here otherwise.
 			 *
-			 * XXX This is not redundant with the gather merge path created in
-			 * generate_gather_paths, because that merely preserves ordering of
-			 * the cheapest partial path, while here we add an explicit sort to
-			 * get match the useful ordering.
+			 * This is not redundant with the gather paths created in
+			 * generate_gather_paths, because that doesn't generate ordered
+			 * output. Here we add an explicit sort to match the useful
+			 * ordering.
 			 */
 			if (cheapest_partial_path == subpath)
 			{
-- 
2.17.1

v51-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v51-0002-Implement-incremental-sort.patchDownload

From b32c2b5565044bc30fcfea672ad8b66909f03df7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v51 2/4] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 doc/src/sgml/perform.sgml                     |   42 +-
 src/backend/commands/explain.c                |  239 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1263 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   72 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   85 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |  134 +-
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  306 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    1 +
 src/include/utils/tuplesort.h                 |   16 +-
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1399 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 41 files changed, 4279 insertions(+), 184 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..675059953b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4554,6 +4554,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort steps.
+        The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index ab090441cf..ee8933861c 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -291,7 +291,47 @@ EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42;
     often see this plan type for queries that fetch just a single row.  It's
     also often used for queries that have an <literal>ORDER BY</literal> condition
     that matches the index order, because then no extra sorting step is needed
-    to satisfy the <literal>ORDER BY</literal>.
+    to satisfy the <literal>ORDER BY</literal>.  In this example, adding
+    <literal>ORDER BY unique1</literal> would use the same plan because the
+    index already implicitly provides the requested ordering.
+   </para>
+
+   <para>
+     The planner may implement an <literal>ORDER BY</literal> clause in several
+     ways.  The above example shows that such an ordering clause may be
+     implemented implicitly.  The planner may also add an explicit
+     <literal>sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY unique1;
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort  (cost=1109.39..1134.39 rows=10000 width=244)
+   Sort Key: unique1
+   ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
+</screen>
+
+    If the a part of the plan guarantess an ordering on a prefix of the
+    required sort keys, then the planner may instead decide to use an
+    <literal>incremental sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY four, ten LIMIT 100;
+                                              QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ Limit  (cost=521.06..538.05 rows=100 width=244)
+   ->  Incremental Sort  (cost=521.06..2220.95 rows=10000 width=244)
+         Sort Key: four, ten
+         Presorted Key: four
+         ->  Index Scan using index_tenk1_on_four on tenk1  (cost=0.29..1510.08 rows=10000 width=244)
+</screen>
+
+    Compared to regular sorts, sorting incrementally allows returning tuples
+    before the entire result set has been sorted, which particularly enables
+    optimizations with <literal>LIMIT</literal> queries.  It may also reduce
+    memory usage and the likelihood of spilling sorts to disk, but it comes at
+    the cost of the increased overhead of splitting the result set into multiple
+    sorting batches.
    </para>
 
    <para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee0e638f33..8aa45a719c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->nPresortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,196 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, bool indent, ExplainState *es)
+{
+	ListCell   *methodCell;
+	List	   *methodNames = NIL;
+
+	/* Generate a list of sort methods used across all groups. */
+	for (int bit = 0; bit < sizeof(bits32); ++bit)
+	{
+		if (groupInfo->sortMethods & (1 << bit))
+		{
+			TuplesortMethod sortMethod = (1 << bit);
+			const char *methodName;
+
+			methodName = tuplesort_method_name(sortMethod);
+			methodNames = lappend(methodNames, unconstify(char *, methodName));
+		}
+	}
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (indent)
+			appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld Sort Method", groupLabel,
+						 groupInfo->groupCount);
+		/* plural/singular based on methodNames size */
+		if (list_length(methodNames) > 1)
+			appendStringInfo(es->str, "s: ");
+		else
+			appendStringInfo(es->str, ": ");
+		foreach(methodCell, methodNames)
+		{
+			appendStringInfo(es->str, "%s", (char *) methodCell->ptr_value);
+			if (foreach_current_index(methodCell) < list_length(methodNames) - 1)
+				appendStringInfo(es->str, ", ");
+		}
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+	}
+	else
+	{
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+			appendStringInfo(es->str, " ");
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+	}
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+		appendStringInfo(es->str, "\n");
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		indent_first_line;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
+
+			indent_first_line = es->workers_state == NULL || es->verbose;
+			show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+											 indent_first_line, es);
+			if (prefixsortGroupInfo->groupCount > 0)
+			{
+				if (es->format == EXPLAIN_FORMAT_TEXT)
+					appendStringInfo(es->str, " ");
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+			}
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "\n");
+
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..bcab7c054c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1263 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * We need to store the instrumentation information in either local node's sort
+ * info or, for a parallel worker process, in the shared info (this avoids
+ * having to additionally memcpy the info from local memory to shared memory
+ * at each instrumentation call). This macro expands to choose the proper sort
+ * state and group info.
+ *
+ * Arguments:
+ * - node: type IncrementalSortState *
+ * - groupName: the token fullsort or prefixsort
+ */
+#define INSTRUMENT_SORT_GROUP(node, groupName) \
+	if (node->ss.ps.instrument != NULL) \
+	{ \
+		if (node->shared_info && node->am_worker) \
+		{ \
+			Assert(IsParallelWorker()); \
+			Assert(ParallelWorkerNumber <= node->shared_info->num_workers); \
+			instrumentSortedGroup(&node->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, node->groupName##_state); \
+		} else { \
+			instrumentSortedGroup(&node->incsort_info.groupName##GroupInfo, node->groupName##_state); \
+		} \
+	}
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	TuplesortInstrumentation sort_instr;
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	groupInfo->sortMethods |= sort_instr.sortMethod;
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->nPresortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->nPresortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			nPresortedCols;
+
+	nPresortedCols = castNode(IncrementalSort, node->ss.ps.plan)->nPresortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = nPresortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			nPresortedCols = plannode->nPresortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - nPresortedCols,
+												&(plannode->sort.sortColIdx[nPresortedCols]),
+												&(plannode->sort.sortOperators[nPresortedCols]),
+												&(plannode->sort.collations[nPresortedCols]),
+												&(plannode->sort.nullsFirst[nPresortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					INSTRUMENT_SORT_GROUP(node, fullsort)
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = 0;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = 0;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c9a90d1191..29da0a6fbb 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(nPresortedCols);
 
 	return newnode;
 }
@@ -4896,6 +4930,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index eb168ffd6d..f1271b6aca 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(nPresortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3784,6 +3800,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..2a2f39bf04 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(nPresortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 9e7e57f118..8a52271692 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..21e3f5a987 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,60 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * pathkeys_count_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	/*
+	 * See if we can avoiding looping through both lists. This optimization
+	 * gains us several percent in planning time in a worst-case test.
+	 */
+	if (keys1 == keys2)
+	{
+		*n_common = list_length(keys1);
+		return true;
+	}
+	else if (keys1 == NIL)
+	{
+		*n_common = 0;
+		return true;
+	}
+	else if (keys2 == NIL)
+	{
+		*n_common = 0;
+		return false;
+	}
+
+	/*
+	 * If both lists are non-empty, iterate through both to find out how many
+	 * items are shared.
+	 */
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	/* If we ended with a null value, then we've processed the whole list. */
+	*n_common = n;
+	return (key1 == NULL);
+}
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1840,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_count_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..5be9135646 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int nPresortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int nPresortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->nPresortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int nPresortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->nPresortedCols = nPresortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'nPresortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int nPresortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, nPresortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f52226ccec..aeb83841d7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4924,13 +4924,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4964,29 +4967,77 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			/*
+			 * Try adding an explicit sort, but only to the cheapest total path
+			 * since a full sort should generally add the same cost to all
+			 * paths.
+			 */
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/*
+			 * If incremental sort is enabled, then try it as well. Unlike with
+			 * regular sorts, we can't just look at the cheapest path, because
+			 * the cost of incremental sort depends on how well presorted the
+			 * path is. Additionally incremental sort may enable a cheaper
+			 * startup path to win out despite higher total cost.
+			 */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..e444aef60a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -779,36 +779,83 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		 * Unless pathkeys are incompatible, see if one of the paths dominates
 		 * the other (both in startup and total cost). It may happen that one
 		 * path has lower startup cost, the other has lower total cost.
-		 *
-		 * XXX Perhaps we could do this only when incremental sort is enabled,
-		 * and use the simpler version (comparing just total cost) otherwise?
 		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			PathCostComparison costcmp;
-
 			/*
-			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 * It's not entirely obvious that we only need to consider startup
+			 * cost when incremental sort is enabled. But doing so saves us ~1%
+			 * of planning time in some worst case scenarios. We have to
+			 * consider startup cost though for incremental sort, because that
+			 * planner option uncovers scenarios where a total higher cost query
+			 * plans over lower cost ones because a lower startup cost but
+			 * higher total cost path is ignored in favor of a higher startup
+			 * cost (but lower total cost plan) before LIMIT optimizations can
+			 * be applied.
 			 */
-			costcmp = compare_path_costs_fuzzily(new_path, old_path,
-												 STD_FUZZ_FACTOR);
-
-			if (costcmp == COSTS_BETTER1)
+			if (enable_incrementalsort)
 			{
-				if (keyscmp == PATHKEYS_BETTER1)
-					remove_old = true;
+				PathCostComparison costcmp;
+
+				/*
+				 * Do a fuzzy cost comparison with standard fuzziness limit.
+				 */
+				costcmp = compare_path_costs_fuzzily(new_path, old_path,
+													 STD_FUZZ_FACTOR);
+
+				if (costcmp == COSTS_BETTER1)
+				{
+					if (keyscmp == PATHKEYS_BETTER1)
+						remove_old = true;
+				}
+				else if (costcmp == COSTS_BETTER2)
+				{
+					if (keyscmp == PATHKEYS_BETTER2)
+						accept_new = false;
+				}
+				else if (costcmp == COSTS_EQUAL)
+				{
+					if (keyscmp == PATHKEYS_BETTER1)
+						remove_old = true;
+					else if (keyscmp == PATHKEYS_BETTER2)
+						accept_new = false;
+				}
 			}
-			else if (costcmp == COSTS_BETTER2)
+			else if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
 			{
-				if (keyscmp == PATHKEYS_BETTER2)
+				/* New path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER1)
 					accept_new = false;
 			}
-			else if (costcmp == COSTS_EQUAL)
+			else if (old_path->total_cost > new_path->total_cost
+					 * STD_FUZZ_FACTOR)
 			{
-				if (keyscmp == PATHKEYS_BETTER1)
+				/* Old path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER2)
 					remove_old = true;
-				else if (keyscmp == PATHKEYS_BETTER2)
-					accept_new = false;
+			}
+			else if (keyscmp == PATHKEYS_BETTER1)
+			{
+				/* Costs are about the same, new path has better pathkeys. */
+				remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER2)
+			{
+				/* Costs are about the same, old path has better pathkeys. */
+				accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
+			{
+				/* Pathkeys are the same, and the old path costs more. */
+				remove_old = true;
+			}
+			else
+			{
+				/*
+				 * Pathkeys are the same, and new path isn't materially
+				 * cheaper.
+				 */
+				accept_new = false;
 			}
 		}
 
@@ -2750,6 +2797,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->nPresortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..fe87d549d9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..427e5e967e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -358,6 +358,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..cc33a85731 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1296,23 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1370,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2764,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2814,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3311,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0fb5d61a3f..fb490b404c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1982,6 +1982,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2010,6 +2025,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instrumentation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	bits32		sortMethods; /* bitmask of TuplesortMethod */
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 5334a73b53..bb2cb70709 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1621,6 +1621,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..be8ef54a1e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..ed50092bc7 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..8d00a9e501 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -61,14 +61,17 @@ typedef struct SortCoordinateData *SortCoordinate;
  * Data structures for reporting sort statistics.  Note that
  * TuplesortInstrumentation can't contain any pointers because we
  * sometimes put it in shared memory.
+ *
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
  */
 typedef enum
 {
-	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_STILL_IN_PROGRESS = 1 << 0,
+	SORT_TYPE_TOP_N_HEAPSORT = 1 << 1,
+	SORT_TYPE_QUICKSORT = 1 << 2,
+	SORT_TYPE_EXTERNAL_SORT = 1 << 3,
+	SORT_TYPE_EXTERNAL_MERGE = 1 << 4
 } TuplesortMethod;
 
 typedef enum
@@ -215,6 +218,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +243,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..288a5b2101
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1399 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                 
+------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "top-N heapsort",               +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                                                           explain_analyze_without_memory                                                            
+-----------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 69724d54b9..9ac816177e 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index 331d92708d..f63e71c075 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

v51-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v51-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 187024ae1f0c3888de4cdf3d4628c099a929d66b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v51 1/4] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

#280

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#279)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Apr 01, 2020 at 10:09:20PM -0400, James Coleman wrote:

On Wed, Apr 1, 2020 at 5:42 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...
I've realized the way get_useful_pathkeys_for_relation() is coded kinda
works against the fastpath we added when comparing pathkeys. That
depends on comparing pointers to the list, but we've been building new
lists (and then returned those) which defeats the optimization. Attached
is a patch that returns the original list in most cases (and only
creates a copy when really necessary). This might also save a few cycles
on bulding the new list, of course.

I've done a bunch of read-only pgbench tests with fairly small scales (1
and 10). First with the built-in read-only transaction, and also with a
simple custom query doing an order-by. And I did this both on the
default schema and with a bunch of extra indexes. The script I used to
run this is attached, along with a summary of results.

There are results for master and v40 and v50 patches (the v50 also
includes the extra patch fixing get_useful_pathkeys_for_relation).

Overall, I'm happy with those results - the v50 seems to be within 1% of
master, in both directions. This very much seems like a noise.

I still want to do a bit more review of the costing / tuplesort changes,
which I plan to do tomorrow. If that goes well, I plan to start
committing this. So please if you think this is not ready or wants more

I think we need to either implement this or remove the comment:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.
in generate_useful_gather_paths().

Yeah. I think we don't need the projection here. My reasoning is that if
we don't need it in generate_gather_paths(), we don't need it here.

In the same function we have the following code:
/*
* When the partial path is already sorted, we can just add a gather
* merge on top, and we're done - no point in adding explicit sort.
*
* XXX Can't we skip this (maybe only for the cheapest partial path)
* when the path is already sorted? Then it's likely duplicate with
* the path created by generate_gather_paths.
*/
if (is_sorted)
{
path = create_gather_merge_path(root, rel, subpath, rel->reltarget,

subpath->pathkeys, NULL, rowsp);

add_path(rel, &path->path);
continue;
}

looking at the relevant loop in generate_gather_paths:
/*
* For each useful ordering, we can consider an order-preserving Gather
* Merge.
*/
foreach(lc, rel->partial_pathlist)
{
Path *subpath = (Path *) lfirst(lc);
GatherMergePath *path;

if (subpath->pathkeys == NIL)
continue;

rows = subpath->rows * subpath->parallel_workers;
path = create_gather_merge_path(root, rel, subpath, rel->reltarget,

subpath->pathkeys, NULL, rowsp);
add_path(rel, &path->path);
}

I believe we can eliminate the block entirely in
generate_useful_gather_paths(). Here's my reasoning: all paths for
which is_sorted is true must necessarily have pathkeys, and since we
already add a gather merge for every subpath with pathkeys, we've
already added gather merge paths for all of these.

I've included a patch to change this, but let me know if the reasoning
isn't sound.

Good catch! I think you're correct - we don't need to generate this
path, and we can just skip that partial path entirely.

We can also remove the XXX on this comment (in the same function):
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

because of this code in generate_gather_paths():
cheapest_partial_path = linitial(rel->partial_pathlist);
rows =
cheapest_partial_path->rows * cheapest_partial_path->parallel_workers;
simple_gather_path = (Path *)
create_gather_path(root, rel, cheapest_partial_path, rel->reltarget,
NULL, rowsp);
add_path(rel, simple_gather_path);

but we can cleanup the comment a bit: fix the grammar issue in the
last line and fix the reference to gather merge path (it's a gather
path).

I've included that in the same patch.

OK, makes sense.

I also noticed that in create_incremental_sort_path we have this:
/* XXX comparison_cost shouldn't be 0? */
but I guess that's part of what you're reviewing tomorrow.

Right, one of the bits.

time for a review, let me know. I'm not yet sure if I'll commit this as
a single change, or in three separate commits.

I don't love the idea of committing it as a single patch, but at least
the first two I think probably go together. Otherwise we're
introducing a "fix" with no proven impact that will slow down planning
(even if only in a small way) only to intend to condition that on a
GUC in the next commit.

But I think you could potentially make an argument for keeping the
additional paths separate...but it's not absolutely necessary IMO.

OK. I've been actually wondering whether to move the add_partial_path
after the main patch, for exactly this reason.

James, can you review the proposed extra fix and merge the fixes into
the main patches?

I've reviewed it, and it looks correct, so merged into the main series.

Summary:
The attached series includes a couple of XXX fixes or comment cleanup
as noted above. I believe there are two more XXXs that needs to be
answered before we merge ("do we need to consider adding a projection"
and "what is the comparison cost for incremental sort").

Thanks!

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#281

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#280)

3 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Apr 1, 2020 at 10:47 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Apr 01, 2020 at 10:09:20PM -0400, James Coleman wrote:

On Wed, Apr 1, 2020 at 5:42 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...
I've realized the way get_useful_pathkeys_for_relation() is coded kinda
works against the fastpath we added when comparing pathkeys. That
depends on comparing pointers to the list, but we've been building new
lists (and then returned those) which defeats the optimization. Attached
is a patch that returns the original list in most cases (and only
creates a copy when really necessary). This might also save a few cycles
on bulding the new list, of course.

I've done a bunch of read-only pgbench tests with fairly small scales (1
and 10). First with the built-in read-only transaction, and also with a
simple custom query doing an order-by. And I did this both on the
default schema and with a bunch of extra indexes. The script I used to
run this is attached, along with a summary of results.

There are results for master and v40 and v50 patches (the v50 also
includes the extra patch fixing get_useful_pathkeys_for_relation).

Overall, I'm happy with those results - the v50 seems to be within 1% of
master, in both directions. This very much seems like a noise.

I still want to do a bit more review of the costing / tuplesort changes,
which I plan to do tomorrow. If that goes well, I plan to start
committing this. So please if you think this is not ready or wants more

I think we need to either implement this or remove the comment:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.
in generate_useful_gather_paths().

Yeah. I think we don't need the projection here. My reasoning is that if
we don't need it in generate_gather_paths(), we don't need it here.

All right, then I'm removing the comment in the attached series.

In the same function we have the following code:
/*
* When the partial path is already sorted, we can just add a gather
* merge on top, and we're done - no point in adding explicit sort.
*
* XXX Can't we skip this (maybe only for the cheapest partial path)
* when the path is already sorted? Then it's likely duplicate with
* the path created by generate_gather_paths.
*/
if (is_sorted)
{
path = create_gather_merge_path(root, rel, subpath, rel->reltarget,

subpath->pathkeys, NULL, rowsp);

add_path(rel, &path->path);
continue;
}

looking at the relevant loop in generate_gather_paths:
/*
* For each useful ordering, we can consider an order-preserving Gather
* Merge.
*/
foreach(lc, rel->partial_pathlist)
{
Path *subpath = (Path *) lfirst(lc);
GatherMergePath *path;

if (subpath->pathkeys == NIL)
continue;

rows = subpath->rows * subpath->parallel_workers;
path = create_gather_merge_path(root, rel, subpath, rel->reltarget,

subpath->pathkeys, NULL, rowsp);
add_path(rel, &path->path);
}

I believe we can eliminate the block entirely in
generate_useful_gather_paths(). Here's my reasoning: all paths for
which is_sorted is true must necessarily have pathkeys, and since we
already add a gather merge for every subpath with pathkeys, we've
already added gather merge paths for all of these.

I've included a patch to change this, but let me know if the reasoning
isn't sound.

Good catch! I think you're correct - we don't need to generate this
path, and we can just skip that partial path entirely.

The attached patch series merges the above fixes into the main patches.

We can also remove the XXX on this comment (in the same function):
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

because of this code in generate_gather_paths():
cheapest_partial_path = linitial(rel->partial_pathlist);
rows =
cheapest_partial_path->rows * cheapest_partial_path->parallel_workers;
simple_gather_path = (Path *)
create_gather_path(root, rel, cheapest_partial_path, rel->reltarget,
NULL, rowsp);
add_path(rel, simple_gather_path);

but we can cleanup the comment a bit: fix the grammar issue in the
last line and fix the reference to gather merge path (it's a gather
path).

I've included that in the same patch.

OK, makes sense.

I also noticed that in create_incremental_sort_path we have this:
/* XXX comparison_cost shouldn't be 0? */
but I guess that's part of what you're reviewing tomorrow.

Right, one of the bits.

time for a review, let me know. I'm not yet sure if I'll commit this as
a single change, or in three separate commits.

I don't love the idea of committing it as a single patch, but at least
the first two I think probably go together. Otherwise we're
introducing a "fix" with no proven impact that will slow down planning
(even if only in a small way) only to intend to condition that on a
GUC in the next commit.

But I think you could potentially make an argument for keeping the
additional paths separate...but it's not absolutely necessary IMO.

OK. I've been actually wondering whether to move the add_partial_path
after the main patch, for exactly this reason.

James, can you review the proposed extra fix and merge the fixes into
the main patches?

I've reviewed it, and it looks correct, so merged into the main series.

Summary:
The attached series includes a couple of XXX fixes or comment cleanup
as noted above. I believe there are two more XXXs that needs to be
answered before we merge ("do we need to consider adding a projection"
and "what is the comparison cost for incremental sort").

Thanks!

One other thing: the only "real" XXX I see left is in create_ordered_paths():
* XXX This is probably duplicate with the paths we already generate
* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.

It's not as big of a deal as the others (e.g., projection one could
have been a bug), and it's a lot harder for me to determine if it's
actually duplicative. So we could just leave it in, we could just
remove it, or, perhaps you have some thoughts on determining if it's
true or not.

Thanks!
James

Attachments:

v52-0003-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v52-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From dd3ce2432b52842623ebc39fbe98148df84d2a3f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v52 3/3] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 --
 src/backend/optimizer/geqo/geqo_eval.c  |   2 +-
 src/backend/optimizer/path/allpaths.c   | 217 +++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++
 src/backend/optimizer/plan/planner.c    | 373 +++++++++++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 6 files changed, 612 insertions(+), 40 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..255f56b827 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,219 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future. For example, we might want to consider pathkeys useful for
+ * merge joins.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		ListCell   *lc;
+		int		npathkeys = 0;	/* useful pathkeys */
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+
+			/*
+			 * We can only build an Incremental Sort for pathkeys which contain
+			 * an EC member in the current relation, so ignore any suffix of the
+			 * list as soon as we find a pathkey without an EC member the
+			 * relation.
+			 *
+			 * By still returning the prefix of the pathkeys list that does meet
+			 * criteria of EC membership in the current relation, we enable not
+			 * just an incremental sort on the entirety of query_pathkeys but
+			 * also incremental sort below a JOIN.
+			 */
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
+				break;
+
+			npathkeys++;
+		}
+
+		/*
+		 * The whole query_pathkeys list matches, so append it directly, to allow
+		 * comparing pathkeys easily by comparing list pointer. If we have to truncate
+		 * the pathkeys, we gotta do a copy though.
+		 */
+		if (npathkeys == list_length(root->query_pathkeys))
+			useful_pathkeys_list = lappend(useful_pathkeys_list,
+										   root->query_pathkeys);
+		else if (npathkeys > 0)
+			useful_pathkeys_list = lappend(useful_pathkeys_list,
+										   list_truncate(list_copy(root->query_pathkeys),
+														 npathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/*
+			 * If the path has no ordering at all, then we can't use either
+			 * incremental sort or rely on implict sorting with a gather merge.
+			 */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_count_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * We don't need to consider the case where a subpath is already
+			 * fully sorted because generate_gather_paths already creates a
+			 * gather merge path for every subpath that has pathkeys present.
+			 *
+			 * But since the subpath is already sorted, we know we don't need
+			 * to consider adding a sort (other either kind) on top of it, so
+			 * we can continue here.
+			 */
+			if (is_sorted)
+				continue;
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * This is not redundant with the gather paths created in
+			 * generate_gather_paths, because that doesn't generate ordered
+			 * output. Here we add an explicit sort to match the useful
+			 * ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				/*
+				 * We should have already excluded pathkeys of length 1 because
+				 * then presorted_keys > 0 would imply is_sorted was true.
+				 */
+				Assert(list_length(useful_pathkeys) != 1);
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3112,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index aeb83841d7..9608fdaec8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5090,6 +5090,71 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 *
+		 * We can also skip the entire loop when we only have a single-item
+		 * sort_pathkeys because then we can't possibly have a presorted
+		 * prefix of the list without having the list be fully sorted.
+		 */
+		if (enable_incrementalsort && list_length(root->sort_pathkeys) > 1)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6444,10 +6509,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
 			if (path == cheapest_path || is_sorted)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
@@ -6503,6 +6572,79 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			/* no shared prefix, no point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			/*
+			 * We should have already excluded pathkeys of length 1 because
+			 * then presorted_keys > 0 would imply is_sorted was true.
+			 */
+			Assert(list_length(root->group_pathkeys) != 1);
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6514,12 +6656,19 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6550,6 +6699,55 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				/*
+				 * We should have already excluded pathkeys of length 1 because
+				 * then presorted_keys > 0 would imply is_sorted was true.
+				 */
+				Assert(list_length(root->group_pathkeys) != 1);
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6821,6 +7019,64 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Consider incremental sort on all partial paths, if enabled.
+		 *
+		 * We can also skip the entire loop when we only have a single-item
+		 * group_pathkeys because then we can't possibly have a presorted
+		 * prefix of the list without having the list be fully sorted.
+		 */
+		if (enable_incrementalsort && list_length(root->group_pathkeys) > 1)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6829,10 +7085,14 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
 			if (path == cheapest_partial_path || is_sorted)
 			{
 				/* Sort the cheapest partial path, if it isn't already */
@@ -6864,6 +7124,55 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			/*
+			 * We should have already excluded pathkeys of length 1 because
+			 * then presorted_keys > 0 would imply is_sorted was true.
+			 */
+			Assert(list_length(root->group_pathkeys) != 1);
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -6961,10 +7270,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -6990,6 +7300,53 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	/*
+	 * Consider incremental sort on all partial paths, if enabled.
+	 *
+	 * We can also skip the entire loop when we only have a single-item
+	 * group_pathkeys because then we can't possibly have a presorted
+	 * prefix of the list without having the list be fully sorted.
+	 */
+	if (!enable_incrementalsort || list_length(root->group_pathkeys) == 1)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7091,7 +7448,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
@@ -7245,7 +7602,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index ed50092bc7..c7bd30a8bf 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.17.1

v52-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v52-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 187024ae1f0c3888de4cdf3d4628c099a929d66b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v52 1/3] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v52-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v52-0002-Implement-incremental-sort.patchDownload

From b32c2b5565044bc30fcfea672ad8b66909f03df7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v52 2/3] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 doc/src/sgml/perform.sgml                     |   42 +-
 src/backend/commands/explain.c                |  239 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1263 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  198 ++-
 src/backend/optimizer/path/pathkeys.c         |   72 +-
 src/backend/optimizer/plan/createplan.c       |  143 +-
 src/backend/optimizer/plan/planner.c          |   85 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |  134 +-
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  306 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    1 +
 src/include/utils/tuplesort.h                 |   16 +-
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1399 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 41 files changed, 4279 insertions(+), 184 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..675059953b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4554,6 +4554,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort steps.
+        The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index ab090441cf..ee8933861c 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -291,7 +291,47 @@ EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42;
     often see this plan type for queries that fetch just a single row.  It's
     also often used for queries that have an <literal>ORDER BY</literal> condition
     that matches the index order, because then no extra sorting step is needed
-    to satisfy the <literal>ORDER BY</literal>.
+    to satisfy the <literal>ORDER BY</literal>.  In this example, adding
+    <literal>ORDER BY unique1</literal> would use the same plan because the
+    index already implicitly provides the requested ordering.
+   </para>
+
+   <para>
+     The planner may implement an <literal>ORDER BY</literal> clause in several
+     ways.  The above example shows that such an ordering clause may be
+     implemented implicitly.  The planner may also add an explicit
+     <literal>sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY unique1;
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort  (cost=1109.39..1134.39 rows=10000 width=244)
+   Sort Key: unique1
+   ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
+</screen>
+
+    If the a part of the plan guarantess an ordering on a prefix of the
+    required sort keys, then the planner may instead decide to use an
+    <literal>incremental sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY four, ten LIMIT 100;
+                                              QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ Limit  (cost=521.06..538.05 rows=100 width=244)
+   ->  Incremental Sort  (cost=521.06..2220.95 rows=10000 width=244)
+         Sort Key: four, ten
+         Presorted Key: four
+         ->  Index Scan using index_tenk1_on_four on tenk1  (cost=0.29..1510.08 rows=10000 width=244)
+</screen>
+
+    Compared to regular sorts, sorting incrementally allows returning tuples
+    before the entire result set has been sorted, which particularly enables
+    optimizations with <literal>LIMIT</literal> queries.  It may also reduce
+    memory usage and the likelihood of spilling sorts to disk, but it comes at
+    the cost of the increased overhead of splitting the result set into multiple
+    sorting batches.
    </para>
 
    <para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee0e638f33..8aa45a719c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->nPresortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,196 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, bool indent, ExplainState *es)
+{
+	ListCell   *methodCell;
+	List	   *methodNames = NIL;
+
+	/* Generate a list of sort methods used across all groups. */
+	for (int bit = 0; bit < sizeof(bits32); ++bit)
+	{
+		if (groupInfo->sortMethods & (1 << bit))
+		{
+			TuplesortMethod sortMethod = (1 << bit);
+			const char *methodName;
+
+			methodName = tuplesort_method_name(sortMethod);
+			methodNames = lappend(methodNames, unconstify(char *, methodName));
+		}
+	}
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (indent)
+			appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld Sort Method", groupLabel,
+						 groupInfo->groupCount);
+		/* plural/singular based on methodNames size */
+		if (list_length(methodNames) > 1)
+			appendStringInfo(es->str, "s: ");
+		else
+			appendStringInfo(es->str, ": ");
+		foreach(methodCell, methodNames)
+		{
+			appendStringInfo(es->str, "%s", (char *) methodCell->ptr_value);
+			if (foreach_current_index(methodCell) < list_length(methodNames) - 1)
+				appendStringInfo(es->str, ", ");
+		}
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+	}
+	else
+	{
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+			appendStringInfo(es->str, " ");
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+	}
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+		appendStringInfo(es->str, "\n");
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		indent_first_line;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
+
+			indent_first_line = es->workers_state == NULL || es->verbose;
+			show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+											 indent_first_line, es);
+			if (prefixsortGroupInfo->groupCount > 0)
+			{
+				if (es->format == EXPLAIN_FORMAT_TEXT)
+					appendStringInfo(es->str, " ");
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+			}
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "\n");
+
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..bcab7c054c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1263 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * We need to store the instrumentation information in either local node's sort
+ * info or, for a parallel worker process, in the shared info (this avoids
+ * having to additionally memcpy the info from local memory to shared memory
+ * at each instrumentation call). This macro expands to choose the proper sort
+ * state and group info.
+ *
+ * Arguments:
+ * - node: type IncrementalSortState *
+ * - groupName: the token fullsort or prefixsort
+ */
+#define INSTRUMENT_SORT_GROUP(node, groupName) \
+	if (node->ss.ps.instrument != NULL) \
+	{ \
+		if (node->shared_info && node->am_worker) \
+		{ \
+			Assert(IsParallelWorker()); \
+			Assert(ParallelWorkerNumber <= node->shared_info->num_workers); \
+			instrumentSortedGroup(&node->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, node->groupName##_state); \
+		} else { \
+			instrumentSortedGroup(&node->incsort_info.groupName##GroupInfo, node->groupName##_state); \
+		} \
+	}
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	TuplesortInstrumentation sort_instr;
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	groupInfo->sortMethods |= sort_instr.sortMethod;
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->nPresortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->nPresortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			nPresortedCols;
+
+	nPresortedCols = castNode(IncrementalSort, node->ss.ps.plan)->nPresortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = nPresortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			nPresortedCols = plannode->nPresortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - nPresortedCols,
+												&(plannode->sort.sortColIdx[nPresortedCols]),
+												&(plannode->sort.sortOperators[nPresortedCols]),
+												&(plannode->sort.collations[nPresortedCols]),
+												&(plannode->sort.nullsFirst[nPresortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					INSTRUMENT_SORT_GROUP(node, fullsort)
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = 0;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = 0;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c9a90d1191..29da0a6fbb 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(nPresortedCols);
 
 	return newnode;
 }
@@ -4896,6 +4930,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index eb168ffd6d..f1271b6aca 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(nPresortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3784,6 +3800,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..2a2f39bf04 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(nPresortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 9e7e57f118..8a52271692 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,163 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ * 	Determines and returns the cost of sorting a relation, including the
+ *	cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+			   Cost input_total_cost, double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
+{
+	cost_tuplesort(startup_cost, run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		*startup_cost += disable_cost;
+
+	*startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
 
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   input_cost,
+				   tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..21e3f5a987 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,60 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * pathkeys_count_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	/*
+	 * See if we can avoiding looping through both lists. This optimization
+	 * gains us several percent in planning time in a worst-case test.
+	 */
+	if (keys1 == keys2)
+	{
+		*n_common = list_length(keys1);
+		return true;
+	}
+	else if (keys1 == NIL)
+	{
+		*n_common = 0;
+		return true;
+	}
+	else if (keys2 == NIL)
+	{
+		*n_common = 0;
+		return false;
+	}
+
+	/*
+	 * If both lists are non-empty, iterate through both to find out how many
+	 * items are shared.
+	 */
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	/* If we ended with a null value, then we've processed the whole list. */
+	*n_common = n;
+	return (key1 == NULL);
+}
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1840,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_count_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..5be9135646 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int nPresortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int nPresortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->nPresortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5088,17 +5127,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
 	Plan	   *lefttree = plan->plan.lefttree;
-	Path		sort_path;		/* dummy for result of cost_sort */
-
-	cost_sort(&sort_path, root, NIL,
-			  lefttree->total_cost,
-			  lefttree->plan_rows,
-			  lefttree->plan_width,
-			  0.0,
-			  work_mem,
-			  limit_tuples);
-	plan->plan.startup_cost = sort_path.startup_cost;
-	plan->plan.total_cost = sort_path.total_cost;
+	Cost		startup_cost,
+				run_cost;
+
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
+	cost_full_sort(&startup_cost, &run_cost,
+				   lefttree->total_cost,
+				   lefttree->plan_rows,
+				   lefttree->plan_width,
+				   0.0,
+				   work_mem,
+				   limit_tuples);
+	plan->plan.startup_cost = startup_cost;
+	plan->plan.total_cost = startup_cost + run_cost;
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5677,9 +5723,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5742,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int nPresortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->nPresortedCols = nPresortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6119,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'nPresortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int nPresortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, nPresortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6890,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f52226ccec..aeb83841d7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4924,13 +4924,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4964,29 +4967,77 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			/*
+			 * Try adding an explicit sort, but only to the cheapest total path
+			 * since a full sort should generally add the same cost to all
+			 * paths.
+			 */
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/*
+			 * If incremental sort is enabled, then try it as well. Unlike with
+			 * regular sorts, we can't just look at the cheapest path, because
+			 * the cost of incremental sort depends on how well presorted the
+			 * path is. Additionally incremental sort may enable a cheaper
+			 * startup path to win out despite higher total cost.
+			 */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..e444aef60a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -779,36 +779,83 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		 * Unless pathkeys are incompatible, see if one of the paths dominates
 		 * the other (both in startup and total cost). It may happen that one
 		 * path has lower startup cost, the other has lower total cost.
-		 *
-		 * XXX Perhaps we could do this only when incremental sort is enabled,
-		 * and use the simpler version (comparing just total cost) otherwise?
 		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			PathCostComparison costcmp;
-
 			/*
-			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 * It's not entirely obvious that we only need to consider startup
+			 * cost when incremental sort is enabled. But doing so saves us ~1%
+			 * of planning time in some worst case scenarios. We have to
+			 * consider startup cost though for incremental sort, because that
+			 * planner option uncovers scenarios where a total higher cost query
+			 * plans over lower cost ones because a lower startup cost but
+			 * higher total cost path is ignored in favor of a higher startup
+			 * cost (but lower total cost plan) before LIMIT optimizations can
+			 * be applied.
 			 */
-			costcmp = compare_path_costs_fuzzily(new_path, old_path,
-												 STD_FUZZ_FACTOR);
-
-			if (costcmp == COSTS_BETTER1)
+			if (enable_incrementalsort)
 			{
-				if (keyscmp == PATHKEYS_BETTER1)
-					remove_old = true;
+				PathCostComparison costcmp;
+
+				/*
+				 * Do a fuzzy cost comparison with standard fuzziness limit.
+				 */
+				costcmp = compare_path_costs_fuzzily(new_path, old_path,
+													 STD_FUZZ_FACTOR);
+
+				if (costcmp == COSTS_BETTER1)
+				{
+					if (keyscmp == PATHKEYS_BETTER1)
+						remove_old = true;
+				}
+				else if (costcmp == COSTS_BETTER2)
+				{
+					if (keyscmp == PATHKEYS_BETTER2)
+						accept_new = false;
+				}
+				else if (costcmp == COSTS_EQUAL)
+				{
+					if (keyscmp == PATHKEYS_BETTER1)
+						remove_old = true;
+					else if (keyscmp == PATHKEYS_BETTER2)
+						accept_new = false;
+				}
 			}
-			else if (costcmp == COSTS_BETTER2)
+			else if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
 			{
-				if (keyscmp == PATHKEYS_BETTER2)
+				/* New path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER1)
 					accept_new = false;
 			}
-			else if (costcmp == COSTS_EQUAL)
+			else if (old_path->total_cost > new_path->total_cost
+					 * STD_FUZZ_FACTOR)
 			{
-				if (keyscmp == PATHKEYS_BETTER1)
+				/* Old path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER2)
 					remove_old = true;
-				else if (keyscmp == PATHKEYS_BETTER2)
-					accept_new = false;
+			}
+			else if (keyscmp == PATHKEYS_BETTER1)
+			{
+				/* Costs are about the same, new path has better pathkeys. */
+				remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER2)
+			{
+				/* Costs are about the same, old path has better pathkeys. */
+				accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
+			{
+				/* Pathkeys are the same, and the old path costs more. */
+				remove_old = true;
+			}
+			else
+			{
+				/*
+				 * Pathkeys are the same, and new path isn't materially
+				 * cheaper.
+				 */
+				accept_new = false;
 			}
 		}
 
@@ -2750,6 +2797,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->nPresortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..fe87d549d9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..427e5e967e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -358,6 +358,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..cc33a85731 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1296,23 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1370,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2764,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2814,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3311,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0fb5d61a3f..fb490b404c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1982,6 +1982,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2010,6 +2025,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instrumentation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	bits32		sortMethods; /* bitmask of TuplesortMethod */
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 5334a73b53..bb2cb70709 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1621,6 +1621,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..be8ef54a1e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..5725b4828e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+						   Cost input_total_cost, double tuples, int width,
+						   Cost comparison_cost, int sort_mem,
+						   double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..ed50092bc7 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..8d00a9e501 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -61,14 +61,17 @@ typedef struct SortCoordinateData *SortCoordinate;
  * Data structures for reporting sort statistics.  Note that
  * TuplesortInstrumentation can't contain any pointers because we
  * sometimes put it in shared memory.
+ *
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
  */
 typedef enum
 {
-	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_STILL_IN_PROGRESS = 1 << 0,
+	SORT_TYPE_TOP_N_HEAPSORT = 1 << 1,
+	SORT_TYPE_QUICKSORT = 1 << 2,
+	SORT_TYPE_EXTERNAL_SORT = 1 << 3,
+	SORT_TYPE_EXTERNAL_MERGE = 1 << 4
 } TuplesortMethod;
 
 typedef enum
@@ -215,6 +218,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +243,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..288a5b2101
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1399 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                 
+------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "top-N heapsort",               +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                                                           explain_analyze_without_memory                                                            
+-----------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 69724d54b9..9ac816177e 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index 331d92708d..f63e71c075 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

#282

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#281)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Hi,

Thanks, the v52 looks almost ready. I've been looking at the two or
three things I mentioned, and I have a couple of comments.

1) /* XXX comparison_cost shouldn't be 0? */

I'm not worried about this, because this is not really intriduced by
this patch - create_sort_path has the same comment/issue, so I think
it's acceptable to do the same thing for incremental sort.

2) INITIAL_MEMTUPSIZE

tuplesort.c does this:

#define INITIAL_MEMTUPSIZE Max(1024, \
ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)

supposedly to keep the array size within ALLOCSET_SEPARATE_THRESHOLD.
But I think it fails to do that, for a couple of reasons.

Firstly, ALLOCSET_SEPARATE_THRESHOLD is 8192, and SortTuple is 21B at
minimum (without padding), so with 1024 elements it's guaranteed to be
at least 21kB - so exceeding the threshold. The maximum value is
something like 256.

Secondly, the second part of the formula is guaranteed to get us over
the threshold anyway, thanks to the +1. Let's say SortTuple is 30B. Then

ALLOCSET_SEPARATE_THRESHOLD / 30 = 273

but we end up using 274, resulting in 8220B array. :-(

So I guess the formula should be more like

Min(128, ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple))

or something like that.

FWIW I think the whole hypothesis that selecting the array size below
ALLOCSET_SEPARATE_THRESHOLD reduces overhead is dubious. AFAIC we
allocate this only once (or very few times), and if we need to grow the
array we'll hit the threshold anyway.

I'd just pick a reasonable constant - 128 or 256 seems reasonable, 1024
may be OK too.

3) add_partial_path

I'm a bit torn about using enable_incrementalsort in add_partial_path. I
know we've done that to eliminate the overhead, but it just seems weird.
I wonder if that really saves us anything or if it was just noise. I'll
do more testing before making a decision.

While looking at add_path, I see the comparisons there are made in the
opposite order - we first compare costs, and only then pathkeys. That
seems like a more efficient way, I guess (cheaper float comparison
first, pathkey comparison with a loop second). I wonder why it's not
done like that in add_partial_path too? Perhaps this will make it cheap
enough to remove the enable_incrementalsort check.

4) add_partial_path_precheck

While looking at add_partial_path, I realized add_partial_path_precheck
probably needs the same change - to consider startup cost too. I think
the expectation is that add_partial_path_precheck only rejects paths
that we know would then get rejected by add_partial_path.

But now the behavior is inconsistent, which might lead to surprising
behavior (I haven't seen such cases, but I haven't looked for them).
At the moment the add_partial_path_precheck is called only from
joinpath.c, maybe it's not an issue there.

If it's OK to keep it like this, it'd probably deserve a comment why
the difference is OK. And the comments also contain the same claim that
parallel plans only need to look at total cost.

5) Overall, I think the costing is OK. I'm sure we'll find cases that
will need improvements, but that's fine. However, we now have

- cost_tuplesort (used to be cost_sort)
- cost_full_sort
- cost_incremental_sort
- cost_sort

I find it a bit confusing that we have cost_sort and cost_full_sort. Why
don't we just keep using the dummy path in label_sort_with_costsize?
That seems to be the only external caller outside costsize.c. Then we
could either make cost_full_sort static or get rid of it entirely.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#283

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#282)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

Thanks, the v52 looks almost ready. I've been looking at the two or
three things I mentioned, and I have a couple of comments.

1) /* XXX comparison_cost shouldn't be 0? */

I'm not worried about this, because this is not really intriduced by
this patch - create_sort_path has the same comment/issue, so I think
it's acceptable to do the same thing for incremental sort.

Sounds good.

2) INITIAL_MEMTUPSIZE

tuplesort.c does this:

#define INITIAL_MEMTUPSIZE Max(1024, \
ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)

supposedly to keep the array size within ALLOCSET_SEPARATE_THRESHOLD.
But I think it fails to do that, for a couple of reasons.

Firstly, ALLOCSET_SEPARATE_THRESHOLD is 8192, and SortTuple is 21B at
minimum (without padding), so with 1024 elements it's guaranteed to be
at least 21kB - so exceeding the threshold. The maximum value is
something like 256.

Secondly, the second part of the formula is guaranteed to get us over
the threshold anyway, thanks to the +1. Let's say SortTuple is 30B. Then

ALLOCSET_SEPARATE_THRESHOLD / 30 = 273

but we end up using 274, resulting in 8220B array. :-(

So I guess the formula should be more like

Min(128, ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple))

or something like that.

FWIW I think the whole hypothesis that selecting the array size below
ALLOCSET_SEPARATE_THRESHOLD reduces overhead is dubious. AFAIC we
allocate this only once (or very few times), and if we need to grow the
array we'll hit the threshold anyway.

I'd just pick a reasonable constant - 128 or 256 seems reasonable, 1024
may be OK too.

That was a part of the patch I haven't touched since I inherited it,
and I didn't feel like I knew enough about the Postgres memory
management to make a determination on whether the reasoning made
sense.

So I' happy to use a constant as suggested.

3) add_partial_path

I'm a bit torn about using enable_incrementalsort in add_partial_path. I
know we've done that to eliminate the overhead, but it just seems weird.
I wonder if that really saves us anything or if it was just noise. I'll
do more testing before making a decision.

While looking at add_path, I see the comparisons there are made in the
opposite order - we first compare costs, and only then pathkeys. That
seems like a more efficient way, I guess (cheaper float comparison
first, pathkey comparison with a loop second). I wonder why it's not
done like that in add_partial_path too? Perhaps this will make it cheap
enough to remove the enable_incrementalsort check.

I would love to avoid checking enable_incrementalsort there. I think
it's pretty gross to do so. But if it's the only way to keep the patch
acceptable, then obviously I'd leave it in. So I'm hopeful we can
avoid needing it.

4) add_partial_path_precheck

While looking at add_partial_path, I realized add_partial_path_precheck
probably needs the same change - to consider startup cost too. I think
the expectation is that add_partial_path_precheck only rejects paths
that we know would then get rejected by add_partial_path.

But now the behavior is inconsistent, which might lead to surprising
behavior (I haven't seen such cases, but I haven't looked for them).
At the moment the add_partial_path_precheck is called only from
joinpath.c, maybe it's not an issue there.

If it's OK to keep it like this, it'd probably deserve a comment why
the difference is OK. And the comments also contain the same claim that
parallel plans only need to look at total cost.

I remember some discussion about that much earlier in this thread (or
in the spin-off thread [1]/messages/by-id/CAAaqYe-5HmM4ih6FWp2RNV9rruunfrFrLhqFXF_nrrNCPy1Zhg@mail.gmail.com re: that specific change that didn't get a
lot of action), and I'm pretty sure the conclusion was that we should
change both. But I guess we never got around to doing it.

5) Overall, I think the costing is OK. I'm sure we'll find cases that
will need improvements, but that's fine. However, we now have

- cost_tuplesort (used to be cost_sort)
- cost_full_sort
- cost_incremental_sort
- cost_sort

I find it a bit confusing that we have cost_sort and cost_full_sort. Why
don't we just keep using the dummy path in label_sort_with_costsize?
That seems to be the only external caller outside costsize.c. Then we
could either make cost_full_sort static or get rid of it entirely.

This another area of the patch I haven't really modified.

James

[1]: /messages/by-id/CAAaqYe-5HmM4ih6FWp2RNV9rruunfrFrLhqFXF_nrrNCPy1Zhg@mail.gmail.com

#284

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#283)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Apr 2, 2020 at 8:46 PM James Coleman <jtc331@gmail.com> wrote:

On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Hi,

Thanks, the v52 looks almost ready. I've been looking at the two or
three things I mentioned, and I have a couple of comments.

1) /* XXX comparison_cost shouldn't be 0? */

I'm not worried about this, because this is not really intriduced by
this patch - create_sort_path has the same comment/issue, so I think
it's acceptable to do the same thing for incremental sort.

Sounds good.

2) INITIAL_MEMTUPSIZE

tuplesort.c does this:

#define INITIAL_MEMTUPSIZE Max(1024, \
ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)

supposedly to keep the array size within ALLOCSET_SEPARATE_THRESHOLD.
But I think it fails to do that, for a couple of reasons.

Firstly, ALLOCSET_SEPARATE_THRESHOLD is 8192, and SortTuple is 21B at
minimum (without padding), so with 1024 elements it's guaranteed to be
at least 21kB - so exceeding the threshold. The maximum value is
something like 256.

Secondly, the second part of the formula is guaranteed to get us over
the threshold anyway, thanks to the +1. Let's say SortTuple is 30B. Then

ALLOCSET_SEPARATE_THRESHOLD / 30 = 273

but we end up using 274, resulting in 8220B array. :-(

So I guess the formula should be more like

Min(128, ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple))

or something like that.

FWIW I think the whole hypothesis that selecting the array size below
ALLOCSET_SEPARATE_THRESHOLD reduces overhead is dubious. AFAIC we
allocate this only once (or very few times), and if we need to grow the
array we'll hit the threshold anyway.

I'd just pick a reasonable constant - 128 or 256 seems reasonable, 1024
may be OK too.

That was a part of the patch I haven't touched since I inherited it,
and I didn't feel like I knew enough about the Postgres memory
management to make a determination on whether the reasoning made
sense.

So I' happy to use a constant as suggested.

I take that back. That code hasn't changed, it's just moved. Here's
the current code in tuplesort_begin_common on master:

/*
* Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
* see comments in grow_memtuples().
*/
state->memtupsize = Max(1024,
ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);

I'm not sure we ought to change that in this patch...

James

#285

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#283)

3 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Apr 2, 2020 at 8:46 PM James Coleman <jtc331@gmail.com> wrote:

On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...
5) Overall, I think the costing is OK. I'm sure we'll find cases that
will need improvements, but that's fine. However, we now have

- cost_tuplesort (used to be cost_sort)
- cost_full_sort
- cost_incremental_sort
- cost_sort

I find it a bit confusing that we have cost_sort and cost_full_sort. Why
don't we just keep using the dummy path in label_sort_with_costsize?
That seems to be the only external caller outside costsize.c. Then we
could either make cost_full_sort static or get rid of it entirely.

This another area of the patch I haven't really modified.

See attached for a cleanup of this; it removed cost_fullsort so
label_sort_with_costsize is back to how it was.

I've directly merged this into the patch series; if you'd like to see
the diff I can send that along.

James

Attachments:

v53-0001-Consider-low-startup-cost-when-adding-partial-pa.patchtext/x-patch; charset=US-ASCII; name=v53-0001-Consider-low-startup-cost-when-adding-partial-pa.patchDownload

From 187024ae1f0c3888de4cdf3d4628c099a929d66b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH v53 1/3] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.17.1

v53-0003-Consider-incremental-sort-paths-in-additional-pl.patchtext/x-patch; charset=US-ASCII; name=v53-0003-Consider-incremental-sort-paths-in-additional-pl.patchDownload

From 4796bde59da44daf484995300161f773c6d18507 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH v53 3/3] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 --
 src/backend/optimizer/geqo/geqo_eval.c  |   2 +-
 src/backend/optimizer/path/allpaths.c   | 217 +++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++
 src/backend/optimizer/plan/planner.c    | 373 +++++++++++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 6 files changed, 612 insertions(+), 40 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..255f56b827 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,219 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future. For example, we might want to consider pathkeys useful for
+ * merge joins.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		ListCell   *lc;
+		int		npathkeys = 0;	/* useful pathkeys */
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+
+			/*
+			 * We can only build an Incremental Sort for pathkeys which contain
+			 * an EC member in the current relation, so ignore any suffix of the
+			 * list as soon as we find a pathkey without an EC member the
+			 * relation.
+			 *
+			 * By still returning the prefix of the pathkeys list that does meet
+			 * criteria of EC membership in the current relation, we enable not
+			 * just an incremental sort on the entirety of query_pathkeys but
+			 * also incremental sort below a JOIN.
+			 */
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
+				break;
+
+			npathkeys++;
+		}
+
+		/*
+		 * The whole query_pathkeys list matches, so append it directly, to allow
+		 * comparing pathkeys easily by comparing list pointer. If we have to truncate
+		 * the pathkeys, we gotta do a copy though.
+		 */
+		if (npathkeys == list_length(root->query_pathkeys))
+			useful_pathkeys_list = lappend(useful_pathkeys_list,
+										   root->query_pathkeys);
+		else if (npathkeys > 0)
+			useful_pathkeys_list = lappend(useful_pathkeys_list,
+										   list_truncate(list_copy(root->query_pathkeys),
+														 npathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/*
+			 * If the path has no ordering at all, then we can't use either
+			 * incremental sort or rely on implict sorting with a gather merge.
+			 */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_count_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * We don't need to consider the case where a subpath is already
+			 * fully sorted because generate_gather_paths already creates a
+			 * gather merge path for every subpath that has pathkeys present.
+			 *
+			 * But since the subpath is already sorted, we know we don't need
+			 * to consider adding a sort (other either kind) on top of it, so
+			 * we can continue here.
+			 */
+			if (is_sorted)
+				continue;
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * This is not redundant with the gather paths created in
+			 * generate_gather_paths, because that doesn't generate ordered
+			 * output. Here we add an explicit sort to match the useful
+			 * ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				/*
+				 * We should have already excluded pathkeys of length 1 because
+				 * then presorted_keys > 0 would imply is_sorted was true.
+				 */
+				Assert(list_length(useful_pathkeys) != 1);
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3112,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index aeb83841d7..9608fdaec8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5090,6 +5090,71 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 *
+		 * We can also skip the entire loop when we only have a single-item
+		 * sort_pathkeys because then we can't possibly have a presorted
+		 * prefix of the list without having the list be fully sorted.
+		 */
+		if (enable_incrementalsort && list_length(root->sort_pathkeys) > 1)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6444,10 +6509,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
 			if (path == cheapest_path || is_sorted)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
@@ -6503,6 +6572,79 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			/* no shared prefix, no point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			/*
+			 * We should have already excluded pathkeys of length 1 because
+			 * then presorted_keys > 0 would imply is_sorted was true.
+			 */
+			Assert(list_length(root->group_pathkeys) != 1);
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6514,12 +6656,19 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6550,6 +6699,55 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				/*
+				 * We should have already excluded pathkeys of length 1 because
+				 * then presorted_keys > 0 would imply is_sorted was true.
+				 */
+				Assert(list_length(root->group_pathkeys) != 1);
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6821,6 +7019,64 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Consider incremental sort on all partial paths, if enabled.
+		 *
+		 * We can also skip the entire loop when we only have a single-item
+		 * group_pathkeys because then we can't possibly have a presorted
+		 * prefix of the list without having the list be fully sorted.
+		 */
+		if (enable_incrementalsort && list_length(root->group_pathkeys) > 1)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6829,10 +7085,14 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
 			if (path == cheapest_partial_path || is_sorted)
 			{
 				/* Sort the cheapest partial path, if it isn't already */
@@ -6864,6 +7124,55 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			/*
+			 * We should have already excluded pathkeys of length 1 because
+			 * then presorted_keys > 0 would imply is_sorted was true.
+			 */
+			Assert(list_length(root->group_pathkeys) != 1);
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -6961,10 +7270,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -6990,6 +7300,53 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	/*
+	 * Consider incremental sort on all partial paths, if enabled.
+	 *
+	 * We can also skip the entire loop when we only have a single-item
+	 * group_pathkeys because then we can't possibly have a presorted
+	 * prefix of the list without having the list be fully sorted.
+	 */
+	if (!enable_incrementalsort || list_length(root->group_pathkeys) == 1)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7091,7 +7448,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
@@ -7245,7 +7602,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index ed50092bc7..c7bd30a8bf 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.17.1

v53-0002-Implement-incremental-sort.patchtext/x-patch; charset=US-ASCII; name=v53-0002-Implement-incremental-sort.patchDownload

From 7d6f6c88092162b65714e82564839c328b822a67 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH v53 2/3] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 doc/src/sgml/perform.sgml                     |   42 +-
 src/backend/commands/explain.c                |  239 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1263 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  178 ++-
 src/backend/optimizer/path/pathkeys.c         |   72 +-
 src/backend/optimizer/plan/createplan.c       |  120 +-
 src/backend/optimizer/plan/planner.c          |   85 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |  134 +-
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  306 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |    6 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    1 +
 src/include/utils/tuplesort.h                 |   16 +-
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1399 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 41 files changed, 4243 insertions(+), 173 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..675059953b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4554,6 +4554,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort steps.
+        The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index ab090441cf..ee8933861c 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -291,7 +291,47 @@ EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42;
     often see this plan type for queries that fetch just a single row.  It's
     also often used for queries that have an <literal>ORDER BY</literal> condition
     that matches the index order, because then no extra sorting step is needed
-    to satisfy the <literal>ORDER BY</literal>.
+    to satisfy the <literal>ORDER BY</literal>.  In this example, adding
+    <literal>ORDER BY unique1</literal> would use the same plan because the
+    index already implicitly provides the requested ordering.
+   </para>
+
+   <para>
+     The planner may implement an <literal>ORDER BY</literal> clause in several
+     ways.  The above example shows that such an ordering clause may be
+     implemented implicitly.  The planner may also add an explicit
+     <literal>sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY unique1;
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort  (cost=1109.39..1134.39 rows=10000 width=244)
+   Sort Key: unique1
+   ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
+</screen>
+
+    If the a part of the plan guarantess an ordering on a prefix of the
+    required sort keys, then the planner may instead decide to use an
+    <literal>incremental sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY four, ten LIMIT 100;
+                                              QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ Limit  (cost=521.06..538.05 rows=100 width=244)
+   ->  Incremental Sort  (cost=521.06..2220.95 rows=10000 width=244)
+         Sort Key: four, ten
+         Presorted Key: four
+         ->  Index Scan using index_tenk1_on_four on tenk1  (cost=0.29..1510.08 rows=10000 width=244)
+</screen>
+
+    Compared to regular sorts, sorting incrementally allows returning tuples
+    before the entire result set has been sorted, which particularly enables
+    optimizations with <literal>LIMIT</literal> queries.  It may also reduce
+    memory usage and the likelihood of spilling sorts to disk, but it comes at
+    the cost of the increased overhead of splitting the result set into multiple
+    sorting batches.
    </para>
 
    <para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee0e638f33..8aa45a719c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->nPresortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,196 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, bool indent, ExplainState *es)
+{
+	ListCell   *methodCell;
+	List	   *methodNames = NIL;
+
+	/* Generate a list of sort methods used across all groups. */
+	for (int bit = 0; bit < sizeof(bits32); ++bit)
+	{
+		if (groupInfo->sortMethods & (1 << bit))
+		{
+			TuplesortMethod sortMethod = (1 << bit);
+			const char *methodName;
+
+			methodName = tuplesort_method_name(sortMethod);
+			methodNames = lappend(methodNames, unconstify(char *, methodName));
+		}
+	}
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (indent)
+			appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld Sort Method", groupLabel,
+						 groupInfo->groupCount);
+		/* plural/singular based on methodNames size */
+		if (list_length(methodNames) > 1)
+			appendStringInfo(es->str, "s: ");
+		else
+			appendStringInfo(es->str, ": ");
+		foreach(methodCell, methodNames)
+		{
+			appendStringInfo(es->str, "%s", (char *) methodCell->ptr_value);
+			if (foreach_current_index(methodCell) < list_length(methodNames) - 1)
+				appendStringInfo(es->str, ", ");
+		}
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+	}
+	else
+	{
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+			appendStringInfo(es->str, " ");
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+	}
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+		appendStringInfo(es->str, "\n");
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		indent_first_line;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
+
+			indent_first_line = es->workers_state == NULL || es->verbose;
+			show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+											 indent_first_line, es);
+			if (prefixsortGroupInfo->groupCount > 0)
+			{
+				if (es->format == EXPLAIN_FORMAT_TEXT)
+					appendStringInfo(es->str, " ");
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+			}
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "\n");
+
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..bcab7c054c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1263 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * We need to store the instrumentation information in either local node's sort
+ * info or, for a parallel worker process, in the shared info (this avoids
+ * having to additionally memcpy the info from local memory to shared memory
+ * at each instrumentation call). This macro expands to choose the proper sort
+ * state and group info.
+ *
+ * Arguments:
+ * - node: type IncrementalSortState *
+ * - groupName: the token fullsort or prefixsort
+ */
+#define INSTRUMENT_SORT_GROUP(node, groupName) \
+	if (node->ss.ps.instrument != NULL) \
+	{ \
+		if (node->shared_info && node->am_worker) \
+		{ \
+			Assert(IsParallelWorker()); \
+			Assert(ParallelWorkerNumber <= node->shared_info->num_workers); \
+			instrumentSortedGroup(&node->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, node->groupName##_state); \
+		} else { \
+			instrumentSortedGroup(&node->incsort_info.groupName##GroupInfo, node->groupName##_state); \
+		} \
+	}
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	TuplesortInstrumentation sort_instr;
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	groupInfo->sortMethods |= sort_instr.sortMethod;
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->nPresortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->nPresortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			nPresortedCols;
+
+	nPresortedCols = castNode(IncrementalSort, node->ss.ps.plan)->nPresortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = nPresortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			nPresortedCols = plannode->nPresortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - nPresortedCols,
+												&(plannode->sort.sortColIdx[nPresortedCols]),
+												&(plannode->sort.sortOperators[nPresortedCols]),
+												&(plannode->sort.collations[nPresortedCols]),
+												&(plannode->sort.nullsFirst[nPresortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					INSTRUMENT_SORT_GROUP(node, fullsort)
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = 0;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = 0;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c9a90d1191..29da0a6fbb 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(nPresortedCols);
 
 	return newnode;
 }
@@ -4896,6 +4930,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index eb168ffd6d..f1271b6aca 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(nPresortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3784,6 +3800,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..2a2f39bf04 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(nPresortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 9e7e57f118..0eef5d7707 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,143 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
 
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_tuplesort(&startup_cost, &run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		startup_cost += disable_cost;
+
+	startup_cost += input_cost;
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..21e3f5a987 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,60 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * pathkeys_count_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	/*
+	 * See if we can avoiding looping through both lists. This optimization
+	 * gains us several percent in planning time in a worst-case test.
+	 */
+	if (keys1 == keys2)
+	{
+		*n_common = list_length(keys1);
+		return true;
+	}
+	else if (keys1 == NIL)
+	{
+		*n_common = 0;
+		return true;
+	}
+	else if (keys2 == NIL)
+	{
+		*n_common = 0;
+		return false;
+	}
+
+	/*
+	 * If both lists are non-empty, iterate through both to find out how many
+	 * items are shared.
+	 */
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	/* If we ended with a null value, then we've processed the whole list. */
+	*n_common = n;
+	return (key1 == NULL);
+}
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1840,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_count_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..6d26bfbeb5 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int nPresortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int nPresortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->nPresortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5090,6 +5129,12 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
 
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
 	cost_sort(&sort_path, root, NIL,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
@@ -5677,9 +5722,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5741,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int nPresortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->nPresortedCols = nPresortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6118,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'nPresortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int nPresortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, nPresortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6889,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f52226ccec..aeb83841d7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4924,13 +4924,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4964,29 +4967,77 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			/*
+			 * Try adding an explicit sort, but only to the cheapest total path
+			 * since a full sort should generally add the same cost to all
+			 * paths.
+			 */
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/*
+			 * If incremental sort is enabled, then try it as well. Unlike with
+			 * regular sorts, we can't just look at the cheapest path, because
+			 * the cost of incremental sort depends on how well presorted the
+			 * path is. Additionally incremental sort may enable a cheaper
+			 * startup path to win out despite higher total cost.
+			 */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..e444aef60a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -779,36 +779,83 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		 * Unless pathkeys are incompatible, see if one of the paths dominates
 		 * the other (both in startup and total cost). It may happen that one
 		 * path has lower startup cost, the other has lower total cost.
-		 *
-		 * XXX Perhaps we could do this only when incremental sort is enabled,
-		 * and use the simpler version (comparing just total cost) otherwise?
 		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			PathCostComparison costcmp;
-
 			/*
-			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 * It's not entirely obvious that we only need to consider startup
+			 * cost when incremental sort is enabled. But doing so saves us ~1%
+			 * of planning time in some worst case scenarios. We have to
+			 * consider startup cost though for incremental sort, because that
+			 * planner option uncovers scenarios where a total higher cost query
+			 * plans over lower cost ones because a lower startup cost but
+			 * higher total cost path is ignored in favor of a higher startup
+			 * cost (but lower total cost plan) before LIMIT optimizations can
+			 * be applied.
 			 */
-			costcmp = compare_path_costs_fuzzily(new_path, old_path,
-												 STD_FUZZ_FACTOR);
-
-			if (costcmp == COSTS_BETTER1)
+			if (enable_incrementalsort)
 			{
-				if (keyscmp == PATHKEYS_BETTER1)
-					remove_old = true;
+				PathCostComparison costcmp;
+
+				/*
+				 * Do a fuzzy cost comparison with standard fuzziness limit.
+				 */
+				costcmp = compare_path_costs_fuzzily(new_path, old_path,
+													 STD_FUZZ_FACTOR);
+
+				if (costcmp == COSTS_BETTER1)
+				{
+					if (keyscmp == PATHKEYS_BETTER1)
+						remove_old = true;
+				}
+				else if (costcmp == COSTS_BETTER2)
+				{
+					if (keyscmp == PATHKEYS_BETTER2)
+						accept_new = false;
+				}
+				else if (costcmp == COSTS_EQUAL)
+				{
+					if (keyscmp == PATHKEYS_BETTER1)
+						remove_old = true;
+					else if (keyscmp == PATHKEYS_BETTER2)
+						accept_new = false;
+				}
 			}
-			else if (costcmp == COSTS_BETTER2)
+			else if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
 			{
-				if (keyscmp == PATHKEYS_BETTER2)
+				/* New path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER1)
 					accept_new = false;
 			}
-			else if (costcmp == COSTS_EQUAL)
+			else if (old_path->total_cost > new_path->total_cost
+					 * STD_FUZZ_FACTOR)
 			{
-				if (keyscmp == PATHKEYS_BETTER1)
+				/* Old path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER2)
 					remove_old = true;
-				else if (keyscmp == PATHKEYS_BETTER2)
-					accept_new = false;
+			}
+			else if (keyscmp == PATHKEYS_BETTER1)
+			{
+				/* Costs are about the same, new path has better pathkeys. */
+				remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER2)
+			{
+				/* Costs are about the same, old path has better pathkeys. */
+				accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
+			{
+				/* Pathkeys are the same, and the old path costs more. */
+				remove_old = true;
+			}
+			else
+			{
+				/*
+				 * Pathkeys are the same, and new path isn't materially
+				 * cheaper.
+				 */
+				accept_new = false;
 			}
 		}
 
@@ -2750,6 +2797,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->nPresortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..fe87d549d9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -989,6 +989,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..427e5e967e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -358,6 +358,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..cc33a85731 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1296,23 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1370,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2764,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2814,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3311,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0fb5d61a3f..fb490b404c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1982,6 +1982,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2010,6 +2025,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instrumentation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	bits32		sortMethods; /* bitmask of TuplesortMethod */
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 5334a73b53..bb2cb70709 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1621,6 +1621,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..be8ef54a1e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..9710e5c0a4 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,11 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..bcd08af753 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..ed50092bc7 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..8d00a9e501 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -61,14 +61,17 @@ typedef struct SortCoordinateData *SortCoordinate;
  * Data structures for reporting sort statistics.  Note that
  * TuplesortInstrumentation can't contain any pointers because we
  * sometimes put it in shared memory.
+ *
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
  */
 typedef enum
 {
-	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_STILL_IN_PROGRESS = 1 << 0,
+	SORT_TYPE_TOP_N_HEAPSORT = 1 << 1,
+	SORT_TYPE_QUICKSORT = 1 << 2,
+	SORT_TYPE_EXTERNAL_SORT = 1 << 3,
+	SORT_TYPE_EXTERNAL_MERGE = 1 << 4
 } TuplesortMethod;
 
 typedef enum
@@ -215,6 +218,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +243,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..288a5b2101
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1399 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                 
+------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "top-N heapsort",               +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                                                           explain_analyze_without_memory                                                            
+-----------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 69724d54b9..9ac816177e 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..175c1d5a49 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..2bcd994361 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index 331d92708d..f63e71c075 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.17.1

#286

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#285)

5 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Apr 02, 2020 at 09:40:45PM -0400, James Coleman wrote:

On Thu, Apr 2, 2020 at 8:46 PM James Coleman <jtc331@gmail.com> wrote:

On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...
5) Overall, I think the costing is OK. I'm sure we'll find cases that
will need improvements, but that's fine. However, we now have

- cost_tuplesort (used to be cost_sort)
- cost_full_sort
- cost_incremental_sort
- cost_sort

I find it a bit confusing that we have cost_sort and cost_full_sort. Why
don't we just keep using the dummy path in label_sort_with_costsize?
That seems to be the only external caller outside costsize.c. Then we
could either make cost_full_sort static or get rid of it entirely.

This another area of the patch I haven't really modified.

See attached for a cleanup of this; it removed cost_fullsort so
label_sort_with_costsize is back to how it was.

I've directly merged this into the patch series; if you'd like to see
the diff I can send that along.

Thanks. Attached is v54 of the patch, with some minor changes. The main
two changes are in add_partial_path_precheck(), firstly to also consider
startup_cost, as discussed before. The second change (in 0003) is a bit
of an experiment to make add_partial_precheck() cheaper by only calling
compare_pathkeys after checking the costs first (which should be cheaper
than the function call). add_path_precheck already does it in that order
anyway.

I noticed is that compare_path_costs_fuzzily and add_path_precheck both
check consider_startup/consider_param_startup to decide whether to look
at startup_cost. add_partial_path_precheck probably should od that too.

Right now I'm running a battery of benchmarks to see if/how this affects
planner performance. Initially the results were rather noisy, but after
pinning the processes to processes (using taskset) and fixing frequency
(using cpupower) it's much better. The intermediate results seem pretty
fine (the results are withing 0.5% of the master, in both directions).
I'll share the final results.

Overall, I think this is pretty close to committable, and I'm planning
to get it committed on Monday unless someone objects.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v54-0001-Consider-low-startup-cost-when-adding-partial-path.patchtext/plain; charset=us-asciiDownload

From 761b935584229243ecc6fd47d83e86d6b1b382c7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:55:54 +0200
Subject: [PATCH 1/5] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the
comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 65 +++++++++++++--------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..b570bfd3be 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel,
  *
  *	  Because we don't consider parameterized paths here, we also don't
  *	  need to consider the row counts as a measure of quality: every path will
- *	  produce the same number of rows.  Neither do we need to consider startup
- *	  costs: parallelism is only used for plans that will be run to completion.
- *	  Therefore, this routine is much simpler than add_path: it needs to
- *	  consider only pathkeys and total cost.
+ *	  produce the same number of rows.  It may however matter how much the
+ *	  path ordering matches the final ordering, needed by upper parts of the
+ *	  plan. Because that will affect how expensive the incremental sort is,
+ *	  we need to consider both the total and startup path, in addition to
+ *	  pathkeys.
  *
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
@@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		/* Compare pathkeys. */
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
-		/* Unless pathkeys are incompatible, keep just one of the two paths. */
+		/*
+		 * Unless pathkeys are incompatible, see if one of the paths dominates
+		 * the other (both in startup and total cost). It may happen that one
+		 * path has lower startup cost, the other has lower total cost.
+		 *
+		 * XXX Perhaps we could do this only when incremental sort is enabled,
+		 * and use the simpler version (comparing just total cost) otherwise?
+		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
-			{
-				/* New path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER1)
-					accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost
-					 * STD_FUZZ_FACTOR)
+			PathCostComparison costcmp;
+
+			/*
+			 * Do a fuzzy cost comparison with standard fuzziness limit.
+			 */
+			costcmp = compare_path_costs_fuzzily(new_path, old_path,
+												 STD_FUZZ_FACTOR);
+
+			if (costcmp == COSTS_BETTER1)
 			{
-				/* Old path costs more; keep it only if pathkeys are better. */
-				if (keyscmp != PATHKEYS_BETTER2)
+				if (keyscmp == PATHKEYS_BETTER1)
 					remove_old = true;
 			}
-			else if (keyscmp == PATHKEYS_BETTER1)
+			else if (costcmp == COSTS_BETTER2)
 			{
-				/* Costs are about the same, new path has better pathkeys. */
-				remove_old = true;
-			}
-			else if (keyscmp == PATHKEYS_BETTER2)
-			{
-				/* Costs are about the same, old path has better pathkeys. */
-				accept_new = false;
-			}
-			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
-			{
-				/* Pathkeys are the same, and the old path costs more. */
-				remove_old = true;
+				if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
-			else
+			else if (costcmp == COSTS_EQUAL)
 			{
-				/*
-				 * Pathkeys are the same, and new path isn't materially
-				 * cheaper.
-				 */
-				accept_new = false;
+				if (keyscmp == PATHKEYS_BETTER1)
+					remove_old = true;
+				else if (keyscmp == PATHKEYS_BETTER2)
+					accept_new = false;
 			}
 		}
 
-- 
2.21.1

v54-0002-rework-add_partial_path_precheck-too.patchtext/plain; charset=us-asciiDownload

From 060234426851cb8f815fb873ab5aaf33b3830143 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Fri, 3 Apr 2020 18:26:29 +0200
Subject: [PATCH 2/5] rework add_partial_path_precheck too

---
 src/backend/optimizer/path/joinpath.c | 12 ++++++++---
 src/backend/optimizer/util/pathnode.c | 31 ++++++++++++---------------
 src/include/optimizer/pathnode.h      |  3 ++-
 3 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index db54a6ba2e..1d0c3e8027 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -515,7 +515,9 @@ try_partial_nestloop_path(PlannerInfo *root,
 	 */
 	initial_cost_nestloop(root, &workspace, jointype,
 						  outer_path, inner_path, extra);
-	if (!add_partial_path_precheck(joinrel, workspace.total_cost, pathkeys))
+	if (!add_partial_path_precheck(joinrel,
+								   workspace.startup_cost, workspace.total_cost,
+								   pathkeys))
 		return;
 
 	/*
@@ -693,7 +695,9 @@ try_partial_mergejoin_path(PlannerInfo *root,
 						   outersortkeys, innersortkeys,
 						   extra);
 
-	if (!add_partial_path_precheck(joinrel, workspace.total_cost, pathkeys))
+	if (!add_partial_path_precheck(joinrel,
+								   workspace.startup_cost, workspace.total_cost,
+								   pathkeys))
 		return;
 
 	/* Might be good enough to be worth trying, so let's try it. */
@@ -817,7 +821,9 @@ try_partial_hashjoin_path(PlannerInfo *root,
 	 */
 	initial_cost_hashjoin(root, &workspace, jointype, hashclauses,
 						  outer_path, inner_path, extra, parallel_hash);
-	if (!add_partial_path_precheck(joinrel, workspace.total_cost, NIL))
+	if (!add_partial_path_precheck(joinrel,
+								   workspace.startup_cost, workspace.total_cost,
+								   NIL))
 		return;
 
 	/* Might be good enough to be worth trying, so let's try it. */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b570bfd3be..7211fc35fd 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -854,15 +854,14 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
  * add_partial_path_precheck
  *	  Check whether a proposed new partial path could possibly get accepted.
  *
- * Unlike add_path_precheck, we can ignore startup cost and parameterization,
- * since they don't matter for partial paths (see add_partial_path).  But
- * we do want to make sure we don't add a partial path if there's already
- * a complete path that dominates it, since in that case the proposed path
- * is surely a loser.
+ * Unlike add_path_precheck, we can ignore parameterization, since it doesn't
+ * matter for partial paths (see add_partial_path).  But we do want to make
+ * sure we don't add a partial path if there's already a complete path that
+ * dominates it, since in that case the proposed path is surely a loser.
  */
 bool
-add_partial_path_precheck(RelOptInfo *parent_rel, Cost total_cost,
-						  List *pathkeys)
+add_partial_path_precheck(RelOptInfo *parent_rel, Cost startup_cost,
+						  Cost total_cost, List *pathkeys)
 {
 	ListCell   *p1;
 
@@ -885,11 +884,14 @@ add_partial_path_precheck(RelOptInfo *parent_rel, Cost total_cost,
 		keyscmp = compare_pathkeys(pathkeys, old_path->pathkeys);
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
-			if (total_cost > old_path->total_cost * STD_FUZZ_FACTOR &&
-				keyscmp != PATHKEYS_BETTER1)
+			if ((startup_cost > old_path->startup_cost * STD_FUZZ_FACTOR) &&
+				(total_cost > old_path->total_cost * STD_FUZZ_FACTOR) &&
+				(keyscmp != PATHKEYS_BETTER1))
 				return false;
-			if (old_path->total_cost > total_cost * STD_FUZZ_FACTOR &&
-				keyscmp != PATHKEYS_BETTER2)
+
+			if ((old_path->startup_cost > startup_cost * STD_FUZZ_FACTOR) &&
+				(old_path->total_cost > total_cost * STD_FUZZ_FACTOR) &&
+				(keyscmp != PATHKEYS_BETTER2))
 				return true;
 		}
 	}
@@ -899,13 +901,8 @@ add_partial_path_precheck(RelOptInfo *parent_rel, Cost total_cost,
 	 * clearly good enough that it might replace one.  Compare it to
 	 * non-parallel plans.  If it loses even before accounting for the cost of
 	 * the Gather node, we should definitely reject it.
-	 *
-	 * Note that we pass the total_cost to add_path_precheck twice.  This is
-	 * because it's never advantageous to consider the startup cost of a
-	 * partial path; the resulting plans, if run in parallel, will be run to
-	 * completion.
 	 */
-	if (!add_path_precheck(parent_rel, total_cost, total_cost, pathkeys,
+	if (!add_path_precheck(parent_rel, startup_cost, total_cost, pathkeys,
 						   NULL))
 		return false;
 
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..e73c5637cc 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,7 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  List *pathkeys, Relids required_outer);
 extern void add_partial_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_partial_path_precheck(RelOptInfo *parent_rel,
-									  Cost total_cost, List *pathkeys);
+									  Cost startup_cost, Cost total_cost,
+									  List *pathkeys);
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 								 Relids required_outer, int parallel_workers);
-- 
2.21.1

v54-0003-rework-add_partial_path_precheck-check-costs-first.patchtext/plain; charset=us-asciiDownload

From e5dc0ccd72cd4dce25de22982a1d950ae73f1f6a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Fri, 3 Apr 2020 18:43:50 +0200
Subject: [PATCH 3/5] rework add_partial_path_precheck - check costs first

---
 src/backend/optimizer/util/pathnode.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7211fc35fd..4e798b801a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -880,18 +880,25 @@ add_partial_path_precheck(RelOptInfo *parent_rel, Cost startup_cost,
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		PathKeysComparison keyscmp;
+		bool		compared = false;
 
-		keyscmp = compare_pathkeys(pathkeys, old_path->pathkeys);
-		if (keyscmp != PATHKEYS_DIFFERENT)
+		if ((startup_cost > old_path->startup_cost * STD_FUZZ_FACTOR) &&
+			(total_cost > old_path->total_cost * STD_FUZZ_FACTOR))
 		{
-			if ((startup_cost > old_path->startup_cost * STD_FUZZ_FACTOR) &&
-				(total_cost > old_path->total_cost * STD_FUZZ_FACTOR) &&
-				(keyscmp != PATHKEYS_BETTER1))
+			keyscmp = compare_pathkeys(pathkeys, old_path->pathkeys);
+			compared = true;
+
+			if (keyscmp != PATHKEYS_BETTER1)
 				return false;
+		}
+
+		if ((old_path->startup_cost > startup_cost * STD_FUZZ_FACTOR) &&
+			(old_path->total_cost > total_cost * STD_FUZZ_FACTOR))
+		{
+			if (!compared)
+				keyscmp = compare_pathkeys(pathkeys, old_path->pathkeys);
 
-			if ((old_path->startup_cost > startup_cost * STD_FUZZ_FACTOR) &&
-				(old_path->total_cost > total_cost * STD_FUZZ_FACTOR) &&
-				(keyscmp != PATHKEYS_BETTER2))
+			if (keyscmp != PATHKEYS_BETTER2)
 				return true;
 		}
 	}
-- 
2.21.1

v54-0004-Implement-incremental-sort.patchtext/plain; charset=us-asciiDownload

From 5bf673fda0f4254367da8a44498fdb2324cf3b8d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 15:25:55 +0100
Subject: [PATCH 4/5] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korotkov@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 doc/src/sgml/perform.sgml                     |   42 +-
 src/backend/commands/explain.c                |  239 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   14 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   34 +
 src/backend/executor/nodeIncrementalSort.c    | 1263 +++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  178 ++-
 src/backend/optimizer/path/pathkeys.c         |   72 +-
 src/backend/optimizer/plan/createplan.c       |  120 +-
 src/backend/optimizer/plan/planner.c          |   85 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   62 +-
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |  306 +++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   28 +
 src/include/nodes/execnodes.h                 |   80 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   10 +
 src/include/optimizer/cost.h                  |    6 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    1 +
 src/include/utils/tuplesort.h                 |   16 +-
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1399 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |  194 +++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 41 files changed, 4183 insertions(+), 161 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4d6ed4bbc..07e18e4ac0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4549,6 +4549,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort steps.
+        The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index ab090441cf..ee8933861c 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -291,7 +291,47 @@ EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42;
     often see this plan type for queries that fetch just a single row.  It's
     also often used for queries that have an <literal>ORDER BY</literal> condition
     that matches the index order, because then no extra sorting step is needed
-    to satisfy the <literal>ORDER BY</literal>.
+    to satisfy the <literal>ORDER BY</literal>.  In this example, adding
+    <literal>ORDER BY unique1</literal> would use the same plan because the
+    index already implicitly provides the requested ordering.
+   </para>
+
+   <para>
+     The planner may implement an <literal>ORDER BY</literal> clause in several
+     ways.  The above example shows that such an ordering clause may be
+     implemented implicitly.  The planner may also add an explicit
+     <literal>sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY unique1;
+                            QUERY PLAN
+-------------------------------------------------------------------
+ Sort  (cost=1109.39..1134.39 rows=10000 width=244)
+   Sort Key: unique1
+   ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
+</screen>
+
+    If the a part of the plan guarantess an ordering on a prefix of the
+    required sort keys, then the planner may instead decide to use an
+    <literal>incremental sort</literal> step:
+
+<screen>
+EXPLAIN SELECT * FROM tenk1 ORDER BY four, ten LIMIT 100;
+                                              QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ Limit  (cost=521.06..538.05 rows=100 width=244)
+   ->  Incremental Sort  (cost=521.06..2220.95 rows=10000 width=244)
+         Sort Key: four, ten
+         Presorted Key: four
+         ->  Index Scan using index_tenk1_on_four on tenk1  (cost=0.29..1510.08 rows=10000 width=244)
+</screen>
+
+    Compared to regular sorts, sorting incrementally allows returning tuples
+    before the entire result set has been sorted, which particularly enables
+    optimizations with <literal>LIMIT</literal> queries.  It may also reduce
+    memory usage and the likelihood of spilling sorts to disk, but it comes at
+    the cost of the increased overhead of splitting the result set into multiple
+    sorting batches.
    </para>
 
    <para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee0e638f33..8aa45a719c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -82,6 +82,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
 							ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
 						   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+									   List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 								   ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -95,7 +97,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 							ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-								 int nkeys, AttrNumber *keycols,
+								 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 								 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 								 List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -103,6 +105,8 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 							 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+									   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
@@ -1240,6 +1244,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Sort:
 			pname = sname = "Sort";
 			break;
+		case T_IncrementalSort:
+			pname = sname = "Incremental Sort";
+			break;
 		case T_Group:
 			pname = sname = "Group";
 			break;
@@ -1899,6 +1906,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			show_sort_keys(castNode(SortState, planstate), ancestors, es);
 			show_sort_info(castNode(SortState, planstate), es);
 			break;
+		case T_IncrementalSort:
+			show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+									   ancestors, es);
+			show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+									   es);
+			break;
 		case T_MergeAppend:
 			show_merge_append_keys(castNode(MergeAppendState, planstate),
 								   ancestors, es);
@@ -2227,12 +2240,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+						   List *ancestors, ExplainState *es)
+{
+	IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+	show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+						 plan->sort.numCols, plan->nPresortedCols,
+						 plan->sort.sortColIdx,
+						 plan->sort.sortOperators, plan->sort.collations,
+						 plan->sort.nullsFirst,
+						 ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2243,7 +2273,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 plan->sortOperators, plan->collations,
 						 plan->nullsFirst,
 						 ancestors, es);
@@ -2267,7 +2297,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
-								 plan->numCols, plan->grpColIdx,
+								 plan->numCols, 0, plan->grpColIdx,
 								 NULL, NULL, NULL,
 								 ancestors, es);
 
@@ -2336,7 +2366,7 @@ show_grouping_set_keys(PlanState *planstate,
 	if (sortnode)
 	{
 		show_sort_group_keys(planstate, "Sort Key",
-							 sortnode->numCols, sortnode->sortColIdx,
+							 sortnode->numCols, 0, sortnode->sortColIdx,
 							 sortnode->sortOperators, sortnode->collations,
 							 sortnode->nullsFirst,
 							 ancestors, es);
@@ -2393,7 +2423,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(plan, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 NULL, NULL, NULL,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
@@ -2406,13 +2436,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
 	List	   *result = NIL;
+	List	   *resultPresorted = NIL;
 	StringInfoData sortkeybuf;
 	bool		useprefix;
 	int			keyno;
@@ -2452,9 +2483,13 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 								   nullsFirst[keyno]);
 		/* Emit one property-list item per sort key */
 		result = lappend(result, pstrdup(sortkeybuf.data));
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2668,6 +2703,196 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	}
 }
 
+/*
+ * Incremental sort nodes sort in (a potentially very large number of) batches,
+ * so EXPLAIN ANALYZE needs to roll up the tuplesort stats from each batch into
+ * an intelligible summary.
+ *
+ * This function is used for both a non-parallel node and each worker in a
+ * parallel incremental sort node.
+ */
+static void
+show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
+								 const char *groupLabel, bool indent, ExplainState *es)
+{
+	ListCell   *methodCell;
+	List	   *methodNames = NIL;
+
+	/* Generate a list of sort methods used across all groups. */
+	for (int bit = 0; bit < sizeof(bits32); ++bit)
+	{
+		if (groupInfo->sortMethods & (1 << bit))
+		{
+			TuplesortMethod sortMethod = (1 << bit);
+			const char *methodName;
+
+			methodName = tuplesort_method_name(sortMethod);
+			methodNames = lappend(methodNames, unconstify(char *, methodName));
+		}
+	}
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (indent)
+			appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(es->str, "%s Groups: %ld Sort Method", groupLabel,
+						 groupInfo->groupCount);
+		/* plural/singular based on methodNames size */
+		if (list_length(methodNames) > 1)
+			appendStringInfo(es->str, "s: ");
+		else
+			appendStringInfo(es->str, ": ");
+		foreach(methodCell, methodNames)
+		{
+			appendStringInfo(es->str, "%s", (char *) methodCell->ptr_value);
+			if (foreach_current_index(methodCell) < list_length(methodNames) - 1)
+				appendStringInfo(es->str, ", ");
+		}
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxMemorySpaceUsed);
+		}
+
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+
+			const char *spaceTypeName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			/* Add a semicolon separator only if memory stats were printed. */
+			if (groupInfo->maxMemorySpaceUsed > 0)
+				appendStringInfo(es->str, ";");
+			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+							 spaceTypeName, avgSpace,
+							 groupInfo->maxDiskSpaceUsed);
+		}
+	}
+	else
+	{
+		StringInfoData groupName;
+
+		initStringInfo(&groupName);
+		appendStringInfo(&groupName, "%s Groups", groupLabel);
+		ExplainOpenGroup("Incremental Sort Groups", groupName.data, true, es);
+		ExplainPropertyInteger("Group Count", NULL, groupInfo->groupCount, es);
+
+		ExplainPropertyList("Sort Methods Used", methodNames, es);
+
+		if (groupInfo->maxMemorySpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData memoryName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
+			initStringInfo(&memoryName);
+			appendStringInfo(&memoryName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxMemorySpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
+		}
+		if (groupInfo->maxDiskSpaceUsed > 0)
+		{
+			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
+			const char *spaceTypeName;
+			StringInfoData diskName;
+
+			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
+			initStringInfo(&diskName);
+			appendStringInfo(&diskName, "Sort Space %s", spaceTypeName);
+			ExplainOpenGroup("Sort Space", diskName.data, true, es);
+
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
+			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+								   groupInfo->maxDiskSpaceUsed, es);
+
+			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
+		}
+
+		ExplainCloseGroup("Incremental Sort Groups", groupName.data, true, es);
+	}
+}
+
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+						   ExplainState *es)
+{
+	IncrementalSortGroupInfo *fullsortGroupInfo;
+	IncrementalSortGroupInfo *prefixsortGroupInfo;
+
+	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
+
+	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+		return;
+
+	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
+	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+	if (prefixsortGroupInfo->groupCount > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+			appendStringInfo(es->str, " ");
+		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+	}
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+		appendStringInfo(es->str, "\n");
+
+	if (incrsortstate->shared_info != NULL)
+	{
+		int			n;
+		bool		indent_first_line;
+
+		for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+		{
+			IncrementalSortInfo *incsort_info =
+			&incrsortstate->shared_info->sinfo[n];
+
+			/*
+			 * If a worker hasn't process any sort groups at all, then exclude
+			 * it from output since it either didn't launch or didn't
+			 * contribute anything meaningful.
+			 */
+			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
+			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+			if (fullsortGroupInfo->groupCount == 0 &&
+				prefixsortGroupInfo->groupCount == 0)
+				continue;
+
+			if (es->workers_state)
+				ExplainOpenWorker(n, es);
+
+			indent_first_line = es->workers_state == NULL || es->verbose;
+			show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort",
+											 indent_first_line, es);
+			if (prefixsortGroupInfo->groupCount > 0)
+			{
+				if (es->format == EXPLAIN_FORMAT_TEXT)
+					appendStringInfo(es->str, " ");
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+			}
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "\n");
+
+			if (es->workers_state)
+				ExplainCloseWorker(n, es);
+		}
+	}
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..f990c6473a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -46,6 +46,7 @@ OBJS = \
 	nodeGroup.o \
 	nodeHash.o \
 	nodeHashjoin.o \
+	nodeIncrementalSort.o \
 	nodeIndexonlyscan.o \
 	nodeIndexscan.o \
 	nodeLimit.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b12aeb3334..e2154ba86a 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecReScanIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecReScanGroup((GroupState *) node);
 			break;
@@ -557,8 +562,17 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 		case T_Material:
 		case T_Sort:
+			/* these don't evaluate tlist */
 			return true;
 
+		case T_IncrementalSort:
+
+			/*
+			 * Unlike full sort, incremental sort keeps only a single group of
+			 * tuples in memory, so it can't scan backwards.
+			 */
+			return false;
+
 		case T_LockRows:
 		case T_Limit:
 			return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..333d4ba1fb 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -282,6 +283,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortEstimate((SortState *) planstate, e->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortEstimate((IncrementalSortState *) planstate, e->pcxt);
+			break;
 
 		default:
 			break;
@@ -495,6 +500,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeDSM((IncrementalSortState *) planstate, d->pcxt);
+			break;
 
 		default:
 			break;
@@ -957,6 +966,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
 			break;
 		case T_HashState:
 		case T_SortState:
+		case T_IncrementalSortState:
 			/* these nodes have DSM state, but no reinitialization is required */
 			break;
 
@@ -1017,6 +1027,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SortState:
 			ExecSortRetrieveInstrumentation((SortState *) planstate);
 			break;
+		case T_IncrementalSortState:
+			ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+			break;
 		case T_HashState:
 			ExecHashRetrieveInstrumentation((HashState *) planstate);
 			break;
@@ -1303,6 +1316,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			/* even when not parallel-aware, for EXPLAIN ANALYZE */
 			ExecSortInitializeWorker((SortState *) planstate, pwcxt);
 			break;
+		case T_IncrementalSortState:
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+												pwcxt);
+			break;
 
 		default:
 			break;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7b2e84f402..5662e7d742 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												estate, eflags);
 			break;
 
+		case T_IncrementalSort:
+			result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
+														   estate, eflags);
+			break;
+
 		case T_Group:
 			result = (PlanState *) ExecInitGroup((Group *) node,
 												 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSort((SortState *) node);
 			break;
 
+		case T_IncrementalSortState:
+			ExecEndIncrementalSort((IncrementalSortState *) node);
+			break;
+
 		case T_GroupState:
 			ExecEndGroup((GroupState *) node);
 			break;
@@ -839,6 +849,30 @@ ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
 			sortState->bound = tuples_needed;
 		}
 	}
+	else if (IsA(child_node, IncrementalSortState))
+	{
+		/*
+		 * If it is an IncrementalSort node, notify it that it can use bounded
+		 * sort.
+		 *
+		 * Note: it is the responsibility of nodeIncrementalSort.c to react
+		 * properly to changes of these parameters.  If we ever redesign this,
+		 * it'd be a good idea to integrate this signaling with the
+		 * parameter-change mechanism.
+		 */
+		IncrementalSortState *sortState = (IncrementalSortState *) child_node;
+
+		if (tuples_needed < 0)
+		{
+			/* make sure flag gets reset if needed upon rescan */
+			sortState->bounded = false;
+		}
+		else
+		{
+			sortState->bounded = true;
+			sortState->bound = tuples_needed;
+		}
+	}
 	else if (IsA(child_node, AppendState))
 	{
 		/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..bcab7c054c
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1263 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.c
+ *	  Routines to handle incremental sorting of relations.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeIncrementalSort.c
+ *
+ * DESCRIPTION
+ *
+ *	Incremental sort is an optimized variant of multikey sort for cases
+ *	when the input is already sorted by a prefix of the sort keys.  For
+ *	example when a sort by (key1, key2 ... keyN) is requested, and the
+ *	input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *	divide the input into groups where keys (key1, ... keyM) are equal,
+ *	and only sort on the remaining columns.
+ *
+ *	Consider the following example.  We have input tuples consisting of
+ *	two integers (X, Y) already presorted by X, while it's required to
+ *	sort them by both X and Y.  Let input tuples be following.
+ *
+ *	(1, 5)
+ *	(1, 2)
+ *	(2, 9)
+ *	(2, 1)
+ *	(2, 5)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	An incremental sort algorithm would split the input into the following
+ *	groups, which have equal X, and then sort them by Y individually:
+ *
+ *		(1, 5) (1, 2)
+ *		(2, 9) (2, 1) (2, 5)
+ *		(3, 3) (3, 7)
+ *
+ *	After sorting these groups and putting them altogether, we would get
+ *	the following result which is sorted by X and Y, as requested:
+ *
+ *	(1, 2)
+ *	(1, 5)
+ *	(2, 1)
+ *	(2, 5)
+ *	(2, 9)
+ *	(3, 3)
+ *	(3, 7)
+ *
+ *	Incremental sort may be more efficient than plain sort, particularly
+ *	on large datasets, as it reduces the amount of data to sort at once,
+ *	making it more likely it fits into work_mem (eliminating the need to
+ *	spill to disk).  But the main advantage of incremental sort is that
+ *	it can start producing rows early, before sorting the whole dataset,
+ *	which is a significant benefit especially for queries with LIMIT.
+ *
+ *	The algorithm we've implemented here is modified from the theoretical
+ *	base described above by operating in two different modes:
+ *	  - Fetching a minimum number of tuples without checking prefix key
+ *	    group membership and sorting on all columns when safe.
+ *	  - Fetching all tuples for a single prefix key group and sorting on
+ *	    solely the unsorted columns.
+ *	We always begin in the first mode, and employ a heuristic to switch
+ *	into the second mode if we believe it's beneficial.
+ *
+ *	Sorting incrementally can potentially use less memory, avoid fetching
+ *	and sorting all tuples in the the dataset, and begin returning tuples
+ *	before the entire result set is available.
+ *
+ *	The hybrid mode approach allows us to optimize for both very small
+ *	groups (where the overhead of a new tuplesort is high) and very	large
+ *	groups (where we can lower cost by not having to sort on already sorted
+ *	columns), albeit at some extra cost while switching between modes.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * We need to store the instrumentation information in either local node's sort
+ * info or, for a parallel worker process, in the shared info (this avoids
+ * having to additionally memcpy the info from local memory to shared memory
+ * at each instrumentation call). This macro expands to choose the proper sort
+ * state and group info.
+ *
+ * Arguments:
+ * - node: type IncrementalSortState *
+ * - groupName: the token fullsort or prefixsort
+ */
+#define INSTRUMENT_SORT_GROUP(node, groupName) \
+	if (node->ss.ps.instrument != NULL) \
+	{ \
+		if (node->shared_info && node->am_worker) \
+		{ \
+			Assert(IsParallelWorker()); \
+			Assert(ParallelWorkerNumber <= node->shared_info->num_workers); \
+			instrumentSortedGroup(&node->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, node->groupName##_state); \
+		} else { \
+			instrumentSortedGroup(&node->incsort_info.groupName##GroupInfo, node->groupName##_state); \
+		} \
+	}
+
+/* ----------------------------------------------------------------
+ * instrumentSortedGroup
+ *
+ * Because incremental sort processes (potentially many) sort batches, we need
+ * to capture tuplesort stats each time we finalize a sort state. This summary
+ * data is later used for EXPLAIN ANALYZE output.
+ * ----------------------------------------------------------------
+ */
+static void
+instrumentSortedGroup(IncrementalSortGroupInfo *groupInfo,
+					  Tuplesortstate *sortState)
+{
+	TuplesortInstrumentation sort_instr;
+	groupInfo->groupCount++;
+
+	tuplesort_get_stats(sortState, &sort_instr);
+
+	/* Calculate total and maximum memory and disk space used. */
+	switch (sort_instr.spaceType)
+	{
+		case SORT_SPACE_TYPE_DISK:
+			groupInfo->totalDiskSpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxDiskSpaceUsed)
+				groupInfo->maxDiskSpaceUsed = sort_instr.spaceUsed;
+
+			break;
+		case SORT_SPACE_TYPE_MEMORY:
+			groupInfo->totalMemorySpaceUsed += sort_instr.spaceUsed;
+			if (sort_instr.spaceUsed > groupInfo->maxMemorySpaceUsed)
+				groupInfo->maxMemorySpaceUsed = sort_instr.spaceUsed;
+
+			break;
+	}
+
+	/* Track each sort method we've used. */
+	groupInfo->sortMethods |= sort_instr.sortMethod;
+}
+
+/* ----------------------------------------------------------------
+ * preparePresortedCols
+ *
+ * Prepare information for presorted_keys comparisons.
+ * ----------------------------------------------------------------
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	node->presorted_keys =
+		(PresortedKeyData *) palloc(plannode->nPresortedCols *
+									sizeof(PresortedKeyData));
+
+	/* Pre-cache comparison functions for each pre-sorted key. */
+	for (int i = 0; i < plannode->nPresortedCols; i++)
+	{
+		Oid			equalityOp,
+					equalityFunc;
+		PresortedKeyData *key;
+
+		key = &node->presorted_keys[i];
+		key->attno = plannode->sort.sortColIdx[i];
+
+		equalityOp = get_equality_op_for_ordering_op(plannode->sort.sortOperators[i],
+													 NULL);
+		if (!OidIsValid(equalityOp))
+			elog(ERROR, "missing equality operator for ordering operator %u",
+				 plannode->sort.sortOperators[i]);
+
+		equalityFunc = get_opcode(equalityOp);
+		if (!OidIsValid(equalityFunc))
+			elog(ERROR, "missing function for operator %u", equalityOp);
+
+		/* Lookup the comparison function */
+		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+		/* We can initialize the callinfo just once and re-use it */
+		key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+		InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+								 plannode->sort.collations[i], NULL, NULL);
+		key->fcinfo->args[0].isnull = false;
+		key->fcinfo->args[1].isnull = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ * isCurrentGroup
+ *
+ * Check whether a given tuple belongs to the current sort group by comparing
+ * the presorted column values to the pivot tuple of the current group.
+ * ----------------------------------------------------------------
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot *tuple)
+{
+	int			nPresortedCols;
+
+	nPresortedCols = castNode(IncrementalSort, node->ss.ps.plan)->nPresortedCols;
+
+	/*
+	 * That the input is sorted by keys * (0, ... n) implies that the tail
+	 * keys are more likely to change. Therefore we do our comparison starting
+	 * from the last pre-sorted column to optimize for early detection of
+	 * inequality and minimizing the number of function calls..
+	 */
+	for (int i = nPresortedCols - 1; i >= 0; i--)
+	{
+		Datum		datumA,
+					datumB,
+					result;
+		bool		isnullA,
+					isnullB;
+		AttrNumber	attno = node->presorted_keys[i].attno;
+		PresortedKeyData *key;
+
+		datumA = slot_getattr(pivot, attno, &isnullA);
+		datumB = slot_getattr(tuple, attno, &isnullB);
+
+		/* Special case for NULL-vs-NULL, else use standard comparison */
+		if (isnullA || isnullB)
+		{
+			if (isnullA == isnullB)
+				continue;
+			else
+				return false;
+		}
+
+		key = &node->presorted_keys[i];
+
+		key->fcinfo->args[0].value = datumA;
+		key->fcinfo->args[1].value = datumB;
+
+		/* just for paranoia's sake, we reset isnull each time */
+		key->fcinfo->isnull = false;
+
+		result = FunctionCallInvoke(key->fcinfo);
+
+		/* Check for null result, since caller is clearly not expecting one */
+		if (key->fcinfo->isnull)
+			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+		if (!DatumGetBool(result))
+			return false;
+	}
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ * switchToPresortedPrefixMode
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back and look at all the tuples we've already accumulated to
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ * ----------------------------------------------------------------
+ */
+static void
+switchToPresortedPrefixMode(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	ScanDirection dir;
+	int64		nTuples = 0;
+	bool		lastTuple = false;
+	bool		firstTuple = true;
+	TupleDesc	tupDesc;
+	PlanState  *outerNode;
+	IncrementalSort *plannode = castNode(IncrementalSort, node->ss.ps.plan);
+
+	dir = node->ss.ps.state->es_direction;
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Configure the prefix sort state the first time around. */
+	if (node->prefixsort_state == NULL)
+	{
+		Tuplesortstate *prefixsort_state;
+		int			nPresortedCols = plannode->nPresortedCols;
+
+		/*
+		 * Optimize the sort by assuming the prefix columns are all equal and
+		 * thus we only need to sort by any remaining columns.
+		 */
+		prefixsort_state = tuplesort_begin_heap(tupDesc,
+												plannode->sort.numCols - nPresortedCols,
+												&(plannode->sort.sortColIdx[nPresortedCols]),
+												&(plannode->sort.sortOperators[nPresortedCols]),
+												&(plannode->sort.collations[nPresortedCols]),
+												&(plannode->sort.nullsFirst[nPresortedCols]),
+												work_mem,
+												NULL,
+												false);
+		node->prefixsort_state = prefixsort_state;
+	}
+	else
+	{
+		/* Next group of presorted data */
+		tuplesort_reset(node->prefixsort_state);
+	}
+
+	/*
+	 * If the current node has a bound, then it's reasonably likely that a
+	 * large prefix key group will benefit from bounded sort, so configure the
+	 * tuplesort to allow for that optimization.
+	 */
+	if (node->bounded)
+	{
+		SO1_printf("Setting bound on presorted prefix tuplesort to: %ld\n",
+				   node->bound - node->bound_Done);
+		tuplesort_set_bound(node->prefixsort_state,
+							node->bound - node->bound_Done);
+	}
+
+	/*
+	 * Copy as many tuples as we can (i.e., in the same prefix key group) from
+	 * the full sort state to the prefix sort state.
+	 */
+	for (;;)
+	{
+		lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+		/*
+		 * When we encounter multiple prefix key groups inside the full sort
+		 * tuplesort we have to carry over the last read tuple into the next
+		 * batch.
+		 */
+		if (firstTuple && !TupIsNull(node->transfer_tuple))
+		{
+			tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+			nTuples++;
+
+			/* The carried over tuple is our new group pivot tuple. */
+			ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		}
+		else
+		{
+			tuplesort_gettupleslot(node->fullsort_state,
+								   ScanDirectionIsForward(dir),
+								   false, node->transfer_tuple, NULL);
+
+			/*
+			 * If this is our first time through the loop, then we need to
+			 * save the first tuple we get as our new group pivot.
+			 */
+			if (TupIsNull(node->group_pivot))
+				ExecCopySlot(node->group_pivot, node->transfer_tuple);
+
+			if (isCurrentGroup(node, node->group_pivot, node->transfer_tuple))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, node->transfer_tuple);
+				nTuples++;
+			}
+			else
+			{
+				/*
+				 * The tuple isn't part of the current batch so we need to
+				 * carry it over into the next batch of tuples we transfer out
+				 * of the full sort tuplesort into the presorted prefix
+				 * tuplesort. We don't actually have to do anything special to
+				 * save the tuple since we've already loaded it into the
+				 * node->transfer_tuple slot, and, even though that slot
+				 * points to memory inside the full sort tuplesort, we can't
+				 * reset that tuplesort anyway until we've fully transferred
+				 * out of its tuples, so this reference is safe. We do need to
+				 * reset the group pivot tuple though since we've finished the
+				 * current prefix key group.
+				 */
+				ExecClearTuple(node->group_pivot);
+				break;
+			}
+		}
+
+		firstTuple = false;
+
+		/*
+		 * If we've copied all of the tuples from the full sort state into the
+		 * prefix sort state, then we don't actually know that we've yet found
+		 * the last tuple in that prefix key group until we check the next
+		 * tuple from the outer plan node, so we retain the current group
+		 * pivot tuple prefix key group comparison.
+		 */
+		if (lastTuple)
+			break;
+	}
+
+	/*
+	 * Track how many tuples remain in the full sort batch so that we know if
+	 * we need to sort multiple prefix key groups before processing tuples
+	 * remaining in the large single prefix key group we think we've
+	 * encountered.
+	 */
+	SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", nTuples);
+	node->n_fullsort_remaining -= nTuples;
+	SO1_printf("Setting n_fullsort_remaining to %ld\n", node->n_fullsort_remaining);
+
+	if (lastTuple)
+	{
+		/*
+		 * We've confirmed that all tuples remaining in the full sort batch is
+		 * in the same prefix key group and moved all of those tuples into the
+		 * presorted prefix tuplesort. Now we can save our pivot comparison
+		 * tuple and continue fetching tuples from the outer execution node to
+		 * load into the presorted prefix tuplesort.
+		 */
+		ExecCopySlot(node->group_pivot, node->transfer_tuple);
+		SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_LOADPREFIXSORT;
+
+		/*
+		 * Make sure we clear the transfer tuple slot so that next time we
+		 * encounter a large prefix key group we don't incorrectly assume we
+		 * have a tuple carried over from the previous group.
+		 */
+		ExecClearTuple(node->transfer_tuple);
+	}
+	else
+	{
+		/*
+		 * We finished a group but didn't consume all of the tuples from the
+		 * full sort state, so we'll sort this batch, let the outer node read
+		 * out all of those tuples, and then come back around to find another
+		 * batch.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  (switchToPresortedPrefixMode)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+	}
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *		ExecIncrementalSort
+ *
+ *		Assuming that outer subtree returns tuple presorted by some prefix
+ *		of target sort columns, performs incremental sort.
+ *
+ *		Conditions:
+ *		  -- none.
+ *
+ *		Initial States:
+ *		  -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+	IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+	EState	   *estate;
+	ScanDirection dir;
+	Tuplesortstate *read_sortstate;
+	Tuplesortstate *fullsort_state;
+	TupleTableSlot *slot;
+	IncrementalSort *plannode = (IncrementalSort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int64		nTuples = 0;
+	int64		minGroupSize;
+
+	CHECK_FOR_INTERRUPTS();
+
+	estate = node->ss.ps.state;
+	dir = estate->es_direction;
+	fullsort_state = node->fullsort_state;
+
+	/*
+	 * If a previous iteration has sorted a batch, then we need to check to
+	 * see if there are any remaining tuples in that batch that we can return
+	 * before moving on to other execution states.
+	 */
+	if (node->execution_status == INCSORT_READFULLSORT
+		|| node->execution_status == INCSORT_READPREFIXSORT)
+	{
+		/*
+		 * Return next tuple from the current sorted group set if available.
+		 */
+		read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+			fullsort_state : node->prefixsort_state;
+		slot = node->ss.ps.ps_ResultTupleSlot;
+
+		/*
+		 * We have to populate the slot from the tuplesort before checking
+		 * outerNodeDone because it will set the slot to NULL if no more
+		 * tuples remain. If the tuplesort is empty, but we don't have any
+		 * more tuples available for sort from the outer node, then
+		 * outerNodeDone will have been set so we'll return that now-empty
+		 * slot to the caller.
+		 */
+		if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								   false, slot, NULL) || node->outerNodeDone)
+
+			/*
+			 * Note: there isn't a good test case for the node->outerNodeDone
+			 * check directly, but we need it for any plan where the outer
+			 * node will fail when trying to fetch too many tuples.
+			 */
+			return slot;
+		else if (node->n_fullsort_remaining > 0)
+		{
+			/*
+			 * When we transition to presorted prefix mode, we might have
+			 * accumulated at least one additional prefix key group in the
+			 * full sort tuplesort. The first call to
+			 * switchToPresortedPrefixMode() will have pulled the first one of
+			 * those groups out, and we've returned those tuples to the parent
+			 * node, but if at this point we still have tuples remaining in
+			 * the full sort state (i.e., n_fullsort_remaining > 0), then we
+			 * need to re-execute the prefix mode transition function to pull
+			 * out the next prefix key group.
+			 */
+			SO1_printf("Re-calling switchToPresortedPrefixMode() because n_fullsort_remaining is > 0 (%ld)\n",
+					   node->n_fullsort_remaining);
+			switchToPresortedPrefixMode(pstate);
+		}
+		else
+		{
+			/*
+			 * If we don't have any sorted tuples to read and we're not
+			 * currently transitioning into presorted prefix sort mode, then
+			 * it's time to start the process all over again by building a new
+			 * group in the full sort state.
+			 */
+			SO_printf("Setting execution_status to INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+			node->execution_status = INCSORT_LOADFULLSORT;
+		}
+	}
+
+	/*
+	 * Scan the subplan in the forward direction while creating the sorted
+	 * data.
+	 */
+	estate->es_direction = ForwardScanDirection;
+
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
+
+	/* Load tuples into the full sort state. */
+	if (node->execution_status == INCSORT_LOADFULLSORT)
+	{
+		/*
+		 * Initialize sorting structures.
+		 */
+		if (fullsort_state == NULL)
+		{
+			/*
+			 * Initialize presorted column support structures for
+			 * isCurrentGroup(). It's correct to do this along with the
+			 * initial intialization for the full sort state (and not for the
+			 * prefix sort state) since we always load the full sort state
+			 * first.
+			 */
+			preparePresortedCols(node);
+
+			/*
+			 * Since we optimize small prefix key groups by accumulating a
+			 * minimum number of tuples before sorting, we can't assume that a
+			 * group of tuples all have the same prefix key values. Hence we
+			 * setup the full sort tuplesort to sort by all requested sort
+			 * keys.
+			 */
+			fullsort_state = tuplesort_begin_heap(tupDesc,
+												  plannode->sort.numCols,
+												  plannode->sort.sortColIdx,
+												  plannode->sort.sortOperators,
+												  plannode->sort.collations,
+												  plannode->sort.nullsFirst,
+												  work_mem,
+												  NULL,
+												  false);
+			node->fullsort_state = fullsort_state;
+		}
+		else
+		{
+			/* Reset sort for the next batch. */
+			tuplesort_reset(fullsort_state);
+		}
+
+		/*
+		 * Calculate the remaining tuples left if bounded and configure both
+		 * bounded sort and the minimum group size accordingly.
+		 */
+		if (node->bounded)
+		{
+			int64		currentBound = node->bound - node->bound_Done;
+
+			/*
+			 * Bounded sort isn't likely to be a useful optimization for full
+			 * sort mode since we limit full sort mode to a relatively small
+			 * number of tuples and tuplesort doesn't switch over to top-n
+			 * heap sort anyway unless it hits (2 * bound) tuples.
+			 */
+			if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+				tuplesort_set_bound(fullsort_state, currentBound);
+
+			minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, currentBound);
+		}
+		else
+			minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+		/*
+		 * Because we have to read the next tuple to find out that we've
+		 * encountered a new prefix key group, on subsequent groups we have to
+		 * carry over that extra tuple and add it to the new group's sort here
+		 * before we read any new tuples from the outer node.
+		 */
+		if (!TupIsNull(node->group_pivot))
+		{
+			tuplesort_puttupleslot(fullsort_state, node->group_pivot);
+			nTuples++;
+
+			/*
+			 * We're in full sort mode accumulating a minimum number of tuples
+			 * and not checking for prefix key equality yet, so we can't
+			 * assume the group pivot tuple will reamin the same -- unless
+			 * we're using a minimum group size of 1, in which case the pivot
+			 * is obviously still the pviot.
+			 */
+			if (nTuples != minGroupSize)
+				ExecClearTuple(node->group_pivot);
+		}
+
+
+		/*
+		 * Pull as many tuples from the outer node as possible given our
+		 * current operating mode.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If the outer node can't provide us any more tuples, then we can
+			 * sort the current group and return those tuples.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+
+				SO1_printf("Sorting fullsort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				SO_printf("Setting execution_status to INCSORT_READFULLSORT (final tuple)\n");
+				node->execution_status = INCSORT_READFULLSORT;
+				break;
+			}
+
+			/* Accumulate the next group of presorted tuples. */
+			if (nTuples < minGroupSize)
+			{
+				/*
+				 * If we haven't yet hit our target minimum group size, then
+				 * we don't need to bother checking for inclusion in the
+				 * current prefix group since at this point we'll assume that
+				 * we'll full sort this batch to avoid a large number of very
+				 * tiny (and thus inefficient) sorts.
+				 */
+				tuplesort_puttupleslot(fullsort_state, slot);
+				nTuples++;
+
+				/*
+				 * If we've reach our minimum group size, then we need to
+				 * store the most recent tuple as a pivot.
+				 */
+				if (nTuples == minGroupSize)
+					ExecCopySlot(node->group_pivot, slot);
+			}
+			else
+			{
+				/*
+				 * If we've already accumulated enough tuples to reach our
+				 * minimum group size, then we need to compare any additional
+				 * tuples to our pivot tuple to see if we reach the end of
+				 * that prefix key group. Only after we find changed prefix
+				 * keys can we guarantee sort stability of the tuples we've
+				 * already accumulated.
+				 */
+				if (isCurrentGroup(node, node->group_pivot, slot))
+				{
+					/*
+					 * As long as the prefix keys match the pivot tuple then
+					 * load the tuple into the tuplesort.
+					 */
+					tuplesort_puttupleslot(fullsort_state, slot);
+					nTuples++;
+				}
+				else
+				{
+					/*
+					 * Since the tuple we fetched isn't part of the current
+					 * prefix key group we don't want to  sort it as part of
+					 * the current batch. Instead we use the group_pivot slot
+					 * to carry it over to the next batch (even though we
+					 * won't actually treat it as a group pivot).
+					 */
+					ExecCopySlot(node->group_pivot, slot);
+
+					if (node->bounded)
+					{
+						/*
+						 * If the current node has a bound, and we've already
+						 * sorted n tuples, then the functional bound
+						 * remaining is (original bound - n), so store the
+						 * current number of processed tuples for later use
+						 * configuring the sort state's bound.
+						 */
+						SO2_printf("Changing bound_Done from %ld to %ld\n",
+								   node->bound_Done,
+								   Min(node->bound, node->bound_Done + nTuples));
+						node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+					}
+
+					/*
+					 * Once we find changed prefix keys we can complete the
+					 * sort and transition modes to reading out the sorted
+					 * tuples.
+					 */
+					SO1_printf("Sorting fullsort tuplesort with %ld tuples\n",
+							   nTuples);
+					tuplesort_performsort(fullsort_state);
+
+					INSTRUMENT_SORT_GROUP(node, fullsort)
+
+					SO_printf("Setting execution_status to INCSORT_READFULLSORT (found end of group)\n");
+					node->execution_status = INCSORT_READFULLSORT;
+					break;
+				}
+			}
+
+			/*
+			 * Unless we've alrady transitioned modes to reading from the full
+			 * sort state, then we assume that having read at least
+			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
+			 * processing a large group of tuples all having equal prefix keys
+			 * (but haven't yet found the final tuple in that prefix key
+			 * group), so we need to transition in to presorted prefix mode.
+			 */
+			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+				node->execution_status != INCSORT_READFULLSORT)
+			{
+				/*
+				 * The group pivot we have stored has already been put into
+				 * the tuplesort; we don't want to carry it over. Since we
+				 * haven't yet found the end of the prefix key group, it might
+				 * seem like we should keep this, but we don't actually know
+				 * how many prefix key groups might be represented in the full
+				 * sort state, so we'll let the mode transition function
+				 * manage this state for us.
+				 */
+				ExecClearTuple(node->group_pivot);
+
+				/*
+				 * Unfortunately the tuplesort API doesn't include a way to
+				 * retrieve tuples unless a sort has been performed, so we
+				 * perform the sort even though we could just as easily rely
+				 * on FIFO retrieval semantics when transferring them to the
+				 * presorted prefix tuplesort.
+				 */
+				SO1_printf("Sorting fullsort tuplesort with %ld tuples\n", nTuples);
+				tuplesort_performsort(fullsort_state);
+
+				INSTRUMENT_SORT_GROUP(node, fullsort)
+
+				/*
+				 * If the full sort tuplesort happened to switch into top-n
+				 * heapsort mode then we will only be able to retrieve
+				 * currentBound tuples (since the tuplesort will have only
+				 * retained the top-n tuples). This is safe even though we
+				 * haven't yet completed fetching the current prefix key group
+				 * because the tuples we've "lost" already sorted "below" the
+				 * retained ones, and we're already contractually guaranteed
+				 * to not need any more than the currentBound tuples.
+				 */
+				if (tuplesort_used_bound(node->fullsort_state))
+				{
+					int64		currentBound = node->bound - node->bound_Done;
+
+					SO2_printf("Read %ld tuples, but setting to %ld because we used bounded sort\n",
+							   nTuples, Min(currentBound, nTuples));
+					nTuples = Min(currentBound, nTuples);
+				}
+
+				SO1_printf("Setting n_fullsort_remaining to %ld and calling switchToPresortedPrefixMode()\n",
+						   nTuples);
+
+				/*
+				 * We might have multiple prefix key groups in the full sort
+				 * state, so the mode transition function needs to know the it
+				 * needs to move from the fullsort to presorted prefix sort.
+				 */
+				node->n_fullsort_remaining = nTuples;
+
+				/* Transition the tuples to the presorted prefix tuplesort. */
+				switchToPresortedPrefixMode(pstate);
+
+				/*
+				 * Since we know we had tuples to move to the presorted prefix
+				 * tuplesort, we know that unless that transition has verified
+				 * that all tuples belonged to the same prefix key group (in
+				 * which case we can go straight to continuing to load tuples
+				 * into that tuplesort), we should have a tuple to return
+				 * here.
+				 *
+				 * Either way, the appropriate execution status should have
+				 * been set by switchToPresortedPrefixMode(), so we can drop
+				 * out of the loop here and let the appropriate path kick in.
+				 */
+				break;
+			}
+		}
+	}
+
+	if (node->execution_status == INCSORT_LOADPREFIXSORT)
+	{
+		/*
+		 * We only enter this state after the mode transition function has
+		 * confirmed all remaining tuples from the full sort state have the
+		 * same prefix and moved those tuples to the prefix sort state. That
+		 * function has also set a group pivot tuple (which doesn't need to be
+		 * carried over; it's already been put into the prefix sort state).
+		 */
+		Assert(!TupIsNull(node->group_pivot));
+
+		/*
+		 * Read tuples from the outer node and load them into the prefix sort
+		 * state until we encounter a tuple whose prefix keys don't match the
+		 * current group_pivot tuple, since we can't guarantee sort stability
+		 * until we have all tuples matching those prefix keys.
+		 */
+		for (;;)
+		{
+			slot = ExecProcNode(outerNode);
+
+			/*
+			 * If we've exhausted tuples from the outer node we're done
+			 * loading the prefix sort state.
+			 */
+			if (TupIsNull(slot))
+			{
+				/*
+				 * We need to know later if the outer node has completed to be
+				 * able to distinguish between being done with a batch and
+				 * being done with the whole node.
+				 */
+				node->outerNodeDone = true;
+				break;
+			}
+
+			/*
+			 * If the tuple's prefix keys match our pivot tuple, we're not
+			 * done yet and can load it into the prefix sort state. If not, we
+			 * don't want to  sort it as part of the current batch. Instead we
+			 * use the group_pivot slot to carry it over to the next batch
+			 * (even though we won't actually treat it as a group pivot).
+			 */
+			if (isCurrentGroup(node, node->group_pivot, slot))
+			{
+				tuplesort_puttupleslot(node->prefixsort_state, slot);
+				nTuples++;
+			}
+			else
+			{
+				ExecCopySlot(node->group_pivot, slot);
+				break;
+			}
+		}
+
+		/*
+		 * Perform the sort and begin returning the tuples to the parent plan
+		 * node.
+		 */
+		SO1_printf("Sorting presorted prefix tuplesort with >= %ld tuples\n", nTuples);
+		tuplesort_performsort(node->prefixsort_state);
+
+		INSTRUMENT_SORT_GROUP(node, prefixsort)
+
+		SO_printf("Setting execution_status to INCSORT_READPREFIXSORT (found end of group)\n");
+		node->execution_status = INCSORT_READPREFIXSORT;
+
+		if (node->bounded)
+		{
+			/*
+			 * If the current node has a bound, and we've already sorted n
+			 * tuples, then the functional bound remaining is (original bound
+			 * - n), so store the current number of processed tuples for use
+			 * in configuring sorting bound.
+			 */
+			SO2_printf("Changing bound_Done from %ld to %ld\n",
+					   node->bound_Done,
+					   Min(node->bound, node->bound_Done + nTuples));
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
+		}
+	}
+
+	/* Restore to user specified direction. */
+	estate->es_direction = dir;
+
+	/*
+	 * Get the first or next tuple from tuplesort. Returns NULL if no more
+	 * tuples.
+	 */
+	read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+		fullsort_state : node->prefixsort_state;
+	slot = node->ss.ps.ps_ResultTupleSlot;
+	(void) tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+								  false, slot, NULL);
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitIncrementalSort
+ *
+ *		Creates the run-time state information for the sort node
+ *		produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+	IncrementalSortState *incrsortstate;
+
+	SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+	/*
+	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * batches in the current sort state.
+	 */
+	Assert((eflags & (EXEC_FLAG_BACKWARD |
+					  EXEC_FLAG_MARK)) == 0);
+
+	/* Initialize state structure. */
+	incrsortstate = makeNode(IncrementalSortState);
+	incrsortstate->ss.ps.plan = (Plan *) node;
+	incrsortstate->ss.ps.state = estate;
+	incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+	incrsortstate->execution_status = INCSORT_LOADFULLSORT;
+	incrsortstate->bounded = false;
+	incrsortstate->outerNodeDone = false;
+	incrsortstate->bound_Done = 0;
+	incrsortstate->fullsort_state = NULL;
+	incrsortstate->prefixsort_state = NULL;
+	incrsortstate->group_pivot = NULL;
+	incrsortstate->transfer_tuple = NULL;
+	incrsortstate->n_fullsort_remaining = 0;
+	incrsortstate->presorted_keys = NULL;
+
+	if (incrsortstate->ss.ps.instrument != NULL)
+	{
+		IncrementalSortGroupInfo *fullsortGroupInfo =
+		&incrsortstate->incsort_info.fullsortGroupInfo;
+		IncrementalSortGroupInfo *prefixsortGroupInfo =
+		&incrsortstate->incsort_info.prefixsortGroupInfo;
+
+		fullsortGroupInfo->groupCount = 0;
+		fullsortGroupInfo->maxDiskSpaceUsed = 0;
+		fullsortGroupInfo->totalDiskSpaceUsed = 0;
+		fullsortGroupInfo->maxMemorySpaceUsed = 0;
+		fullsortGroupInfo->totalMemorySpaceUsed = 0;
+		fullsortGroupInfo->sortMethods = 0;
+		prefixsortGroupInfo->groupCount = 0;
+		prefixsortGroupInfo->maxDiskSpaceUsed = 0;
+		prefixsortGroupInfo->totalDiskSpaceUsed = 0;
+		prefixsortGroupInfo->maxMemorySpaceUsed = 0;
+		prefixsortGroupInfo->totalMemorySpaceUsed = 0;
+		prefixsortGroupInfo->sortMethods = 0;
+	}
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * Sort nodes don't initialize their ExprContexts because they never call
+	 * ExecQual or ExecProject.
+	 */
+
+	/*
+	 * Initialize child nodes.
+	 *
+	 * We shield the child node from the need to support REWIND, BACKWARD, or
+	 * MARK/RESTORE.
+	 */
+	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+	outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+	/*
+	 * Initialize scan slot and type.
+	 */
+	ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, &TTSOpsMinimalTuple);
+
+	/*
+	 * Initialize return slot and type. No need to initialize projection info
+	 * because we don't do any projections.
+	 */
+	ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+	incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+	/*
+	 * Initialize standalone slots to store a tuple for pivot prefix keys and
+	 * for carrying over a tuple from one batch to the next.
+	 */
+	incrsortstate->group_pivot =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+	incrsortstate->transfer_tuple =
+		MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(incrsortstate)),
+								 &TTSOpsMinimalTuple);
+
+	SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+	return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+	SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+	/* clean out the scan tuple */
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	/* must drop stanalone tuple slots from outer node */
+	ExecDropSingleTupleTableSlot(node->group_pivot);
+	ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+	/*
+	 * Release tuplesort resources.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_end(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_end(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * Shut down the subplan.
+	 */
+	ExecEndNode(outerPlanState(node));
+
+	SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+	PlanState  *outerPlan = outerPlanState(node);
+
+	/*
+	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * haven't changed (e.g., rewind) because unlike regular sort we don't
+	 * store all tuples at once for the full sort.
+	 *
+	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
+	 * reexecute the sort along with the child node below us.
+	 *
+	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * reset it for a new batch yet) then we could efficiently rewind, but
+	 * that seems a narrow enough case that it's not worth handling specially
+	 * at this time.
+	 */
+
+	/* must drop pointer to sort result tuple */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	if (node->group_pivot != NULL)
+		ExecClearTuple(node->group_pivot);
+	if (node->transfer_tuple != NULL)
+		ExecClearTuple(node->transfer_tuple);
+
+	node->bounded = false;
+	node->outerNodeDone = false;
+	node->n_fullsort_remaining = 0;
+	node->bound_Done = 0;
+	node->presorted_keys = NULL;
+
+	node->execution_status = INCSORT_LOADFULLSORT;
+
+	/*
+	 * If we've set up either of the sort states yet, we need to reset them.
+	 * We could end them and null out the pointers, but there's no reason to
+	 * repay the setup cost, and because guard setting up pivot comparator
+	 * state similarly, doing so might actually cause a leak.
+	 */
+	if (node->fullsort_state != NULL)
+	{
+		tuplesort_reset(node->fullsort_state);
+		node->fullsort_state = NULL;
+	}
+	if (node->prefixsort_state != NULL)
+	{
+		tuplesort_reset(node->prefixsort_state);
+		node->prefixsort_state = NULL;
+	}
+
+	/*
+	 * If chgParam of subnode is not null, theni the plan will be re-scanned
+	 * by the first ExecProcNode.
+	 */
+	if (outerPlan->chgParam == NULL)
+		ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *						Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSortEstimate
+ *
+ *		Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+	size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeDSM
+ *
+ *		Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt)
+{
+	Size		size;
+
+	/* don't need this if not instrumenting or no workers */
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ pcxt->nworkers * sizeof(IncrementalSortInfo);
+	node->shared_info = shm_toc_allocate(pcxt->toc, size);
+	/* ensure any unfilled slots will contain zeroes */
+	memset(node->shared_info, 0, size);
+	node->shared_info->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+				   node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortInitializeWorker
+ *
+ *		Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pwcxt)
+{
+	node->shared_info =
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, true);
+	node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSortRetrieveInstrumentation
+ *
+ *		Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+	Size		size;
+	SharedIncrementalSortInfo *si;
+
+	if (node->shared_info == NULL)
+		return;
+
+	size = offsetof(SharedIncrementalSortInfo, sinfo)
+		+ node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+	si = palloc(size);
+	memcpy(si, node->shared_info, size);
+	node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 5d1debc196..9d2bfd7ed6 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
-											  NULL, node->randomAccess);
+											  NULL,
+											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c9a90d1191..29da0a6fbb 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -927,6 +927,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *		This function copies the fields of the Sort node.  It is used by
+ *		all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	COPY_SCALAR_FIELD(numCols);
+	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -938,13 +956,29 @@ _copySort(const Sort *from)
 	/*
 	 * copy node superclass fields
 	 */
-	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+	CopySortFields(from, newnode);
 
-	COPY_SCALAR_FIELD(numCols);
-	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-	COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+	return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+	IncrementalSort *newnode = makeNode(IncrementalSort);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopySortFields((const Sort *) from, (Sort *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(nPresortedCols);
 
 	return newnode;
 }
@@ -4896,6 +4930,9 @@ copyObjectImpl(const void *from)
 		case T_Sort:
 			retval = _copySort(from);
 			break;
+		case T_IncrementalSort:
+			retval = _copyIncrementalSort(from);
+			break;
 		case T_Group:
 			retval = _copyGroup(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index eb168ffd6d..f1271b6aca 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -837,10 +837,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-	WRITE_NODE_TYPE("SORT");
-
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_INT_FIELD(numCols);
@@ -850,6 +848,24 @@ _outSort(StringInfo str, const Sort *node)
 	WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+	WRITE_NODE_TYPE("SORT");
+
+	_outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+	WRITE_NODE_TYPE("INCREMENTALSORT");
+
+	_outSortInfo(str, (const Sort *) node);
+
+	WRITE_INT_FIELD(nPresortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3784,6 +3800,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Sort:
 				_outSort(str, obj);
 				break;
+			case T_IncrementalSort:
+				_outIncrementalSort(str, obj);
+				break;
 			case T_Unique:
 				_outUnique(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..2a2f39bf04 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2150,12 +2150,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *	Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-	READ_LOCALS(Sort);
+	READ_TEMP_LOCALS();
 
 	ReadCommonPlan(&local_node->plan);
 
@@ -2164,6 +2165,32 @@ _readSort(void)
 	READ_OID_ARRAY(sortOperators, local_node->numCols);
 	READ_OID_ARRAY(collations, local_node->numCols);
 	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS_NO_FIELDS(Sort);
+
+	ReadCommonSort(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+	READ_LOCALS(IncrementalSort);
+
+	ReadCommonSort(&local_node->sort);
+
+	READ_INT_FIELD(nPresortedCols);
 
 	READ_DONE();
 }
@@ -2801,6 +2828,8 @@ parseNodeString(void)
 		return_value = _readMaterial();
 	else if (MATCH("SORT", 4))
 		return_value = _readSort();
+	else if (MATCH("INCREMENTALSORT", 15))
+		return_value = _readIncrementalSort();
 	else if (MATCH("GROUP", 5))
 		return_value = _readGroup();
 	else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..ccf46dd0aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
 			ptype = "Sort";
 			subpath = ((SortPath *) path)->subpath;
 			break;
+		case T_IncrementalSortPath:
+			ptype = "IncrementalSort";
+			subpath = ((SortPath *) path)->subpath;
+			break;
 		case T_GroupPath:
 			ptype = "Group";
 			subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 9e7e57f118..0eef5d7707 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -128,6 +128,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_incrementalsort = true;
 bool		enable_hashagg = true;
 bool		enable_hashagg_disk = true;
 bool		enable_groupingsets_hash_disk = false;
@@ -1648,9 +1649,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 }
 
 /*
- * cost_sort
- *	  Determines and returns the cost of sorting a relation, including
- *	  the cost of reading the input data.
+ * cost_tuplesort
+ *	  Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1677,39 +1678,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
-		  double limit_tuples)
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+			   double tuples, int width,
+			   Cost comparison_cost, int sort_mem,
+			   double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
-	if (!enable_sort)
-		startup_cost += disable_cost;
-
-	path->rows = tuples;
-
 	/*
 	 * We want to be sure the cost of a sort is never estimated as zero, even
 	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1748,7 +1733,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 
 		/* Disk costs */
 
@@ -1759,7 +1744,7 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		*startup_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1770,12 +1755,12 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		*startup_cost = comparison_cost * tuples * LOG2(tuples);
 	}
 
 	/*
@@ -1786,8 +1771,143 @@ cost_sort(Path *path, PlannerInfo *root,
 	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
 	 * counting the LIMIT otherwise.
 	 */
-	run_cost += cpu_operator_cost * tuples;
+	*run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_incremental_sort
+ * 	Determines and returns the cost of sorting a relation incrementally, when
+ *  the input path is presorted by a prefix of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Cost input_startup_cost, Cost input_total_cost,
+					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+					  double limit_tuples)
+{
+	Cost		startup_cost = 0,
+				run_cost = 0,
+				input_run_cost = input_total_cost - input_startup_cost;
+	double		group_tuples,
+				input_groups;
+	Cost		group_startup_cost,
+				group_run_cost,
+				group_input_run_cost;
+	List	   *presortedExprs = NIL;
+	ListCell   *l;
+	int			i = 0;
+
+	Assert(presorted_keys != 0);
+
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Extract presorted keys as list of expressions */
+	foreach(l, pathkeys)
+	{
+		PathKey    *key = (PathKey *) lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+		linitial(key->pk_eclass->ec_members);
+
+		presortedExprs = lappend(presortedExprs, member->em_expr);
+
+		i++;
+		if (i >= presorted_keys)
+			break;
+	}
+
+	/* Estimate number of groups with equal presorted keys */
+	input_groups = estimate_num_groups(root, presortedExprs, input_tuples, NULL);
+	group_tuples = input_tuples / input_groups;
+	group_input_run_cost = input_run_cost / input_groups;
+
+	/*
+	 * Estimate average cost of sorting of one group where presorted keys are
+	 * equal.  Incremental sort is sensitive to distribution of tuples to the
+	 * groups, where we're relying on quite rough assumptions.  Thus, we're
+	 * pessimistic about incremental sort performance and increase its average
+	 * group size by half.
+	 */
+	cost_tuplesort(&group_startup_cost, &group_run_cost,
+				   1.5 * group_tuples, width, comparison_cost, sort_mem,
+				   limit_tuples);
+
+	/*
+	 * Startup cost of incremental sort is the startup cost of its first group
+	 * plus the cost of its input.
+	 */
+	startup_cost += group_startup_cost
+		+ input_startup_cost + group_input_run_cost;
+
+	/*
+	 * After we started producing tuples from the first group, the cost of
+	 * producing all the tuples is given by the cost to finish processing this
+	 * group, plus the total cost to process the remaining groups, plus the
+	 * remaining cost of input.
+	 */
+	run_cost += group_run_cost
+		+ (group_run_cost + group_startup_cost) * (input_groups - 1)
+		+ group_input_run_cost * (input_groups - 1);
+
+	/*
+	 * Incremental sort adds some overhead by itself. Firstly, it has to
+	 * detect the sort groups. This is roughly equal to one extra copy and
+	 * comparison per tuple. Secondly, it has to reset the tuplesort context
+	 * for every group.
+	 */
+	run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+	run_cost += 2.0 * cpu_tuple_cost * input_groups;
 
+	path->rows = input_tuples;
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *	  Determines and returns the cost of sorting a relation, including
+ *	  the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+		  List *pathkeys, Cost input_cost, double tuples, int width,
+		  Cost comparison_cost, int sort_mem,
+		  double limit_tuples)
+
+{
+	Cost		startup_cost;
+	Cost		run_cost;
+
+	cost_tuplesort(&startup_cost, &run_cost,
+				   tuples, width,
+				   comparison_cost, sort_mem,
+				   limit_tuples);
+
+	if (!enable_sort)
+		startup_cost += disable_cost;
+
+	startup_cost += input_cost;
+
+	path->rows = tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 71b9d42c99..21e3f5a987 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,60 @@ pathkeys_contained_in(List *keys1, List *keys2)
 	return false;
 }
 
+/*
+ * pathkeys_count_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common)
+{
+	int			n = 0;
+	ListCell   *key1,
+			   *key2;
+
+	/*
+	 * See if we can avoiding looping through both lists. This optimization
+	 * gains us several percent in planning time in a worst-case test.
+	 */
+	if (keys1 == keys2)
+	{
+		*n_common = list_length(keys1);
+		return true;
+	}
+	else if (keys1 == NIL)
+	{
+		*n_common = 0;
+		return true;
+	}
+	else if (keys2 == NIL)
+	{
+		*n_common = 0;
+		return false;
+	}
+
+	/*
+	 * If both lists are non-empty, iterate through both to find out how many
+	 * items are shared.
+	 */
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+		{
+			*n_common = n;
+			return false;
+		}
+		n++;
+	}
+
+	/* If we ended with a null value, then we've processed the whole list. */
+	*n_common = n;
+	return (key1 == NULL);
+}
+
 /*
  * get_cheapest_path_for_pathkeys
  *	  Find the cheapest path (according to the specified criterion) that
@@ -1786,26 +1840,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Because we the have the possibility of incremental sort, a prefix list of
+ * keys is potentially useful for improving the performance of the requested
+ * ordering. Thus we return 0, if no valuable keys are found, or the number
+ * of leading keys shared by the list and the requested ordering..
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int			n_common_pathkeys;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-	{
-		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
-	}
+	(void) pathkeys_count_contained_in(root->query_pathkeys, pathkeys,
+										&n_common_pathkeys);
 
-	return 0;					/* path ordering not useful */
+	return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..6d26bfbeb5 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
 									int flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+													IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath *best_path,
 										int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
 					   AttrNumber *sortColIdx, Oid *sortOperators,
 					   Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+											 int numCols, int nPresortedCols,
+											 AttrNumber *sortColIdx, Oid *sortOperators,
+											 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 										Relids relids,
 										const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
 												 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 									 Relids relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+														   List *pathkeys, Relids relids, int nPresortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
 									  AttrNumber *grpColIdx,
 									  Plan *lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
 											 (SortPath *) best_path,
 											 flags);
 			break;
+		case T_IncrementalSort:
+			plan = (Plan *) create_incrementalsort_plan(root,
+														(IncrementalSortPath *) best_path,
+														flags);
+			break;
 		case T_Group:
 			plan = (Plan *) create_group_plan(root,
 											  (GroupPath *) best_path);
@@ -1994,6 +2007,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 	return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *	  Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+							int flags)
+{
+	IncrementalSort *plan;
+	Plan	   *subplan;
+
+	/* See comments in create_sort_plan() above */
+	subplan = create_plan_recurse(root, best_path->spath.subpath,
+								  flags | CP_SMALL_TLIST);
+	plan = make_incrementalsort_from_pathkeys(subplan,
+											  best_path->spath.path.pathkeys,
+											  IS_OTHER_REL(best_path->spath.subpath->parent) ?
+											  best_path->spath.path.parent->relids : NULL,
+											  best_path->nPresortedCols);
+
+	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+	return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5090,6 +5129,12 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 	Plan	   *lefttree = plan->plan.lefttree;
 	Path		sort_path;		/* dummy for result of cost_sort */
 
+	/*
+	 * This function shouldn't have to deal with IncrementalSort plans because
+	 * they are only created from corresponding Path nodes.
+	 */
+	Assert(IsA(plan, Sort));
+
 	cost_sort(&sort_path, root, NIL,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
@@ -5677,9 +5722,12 @@ make_sort(Plan *lefttree, int numCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst)
 {
-	Sort	   *node = makeNode(Sort);
-	Plan	   *plan = &node->plan;
+	Sort	   *node;
+	Plan	   *plan;
 
+	node = makeNode(Sort);
+
+	plan = &node->plan;
 	plan->targetlist = lefttree->targetlist;
 	plan->qual = NIL;
 	plan->lefttree = lefttree;
@@ -5693,6 +5741,37 @@ make_sort(Plan *lefttree, int numCols,
 	return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build an IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int nPresortedCols,
+					 AttrNumber *sortColIdx, Oid *sortOperators,
+					 Oid *collations, bool *nullsFirst)
+{
+	IncrementalSort *node;
+	Plan	   *plan;
+
+	node = makeNode(IncrementalSort);
+
+	plan = &node->sort.plan;
+	plan->targetlist = lefttree->targetlist;
+	plan->qual = NIL;
+	plan->lefttree = lefttree;
+	plan->righttree = NULL;
+	node->nPresortedCols = nPresortedCols;
+	node->sort.numCols = numCols;
+	node->sort.sortColIdx = sortColIdx;
+	node->sort.sortOperators = sortOperators;
+	node->sort.collations = collations;
+	node->sort.nullsFirst = nullsFirst;
+
+	return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -6039,6 +6118,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, Relids relids)
 					 collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *	  Create sort plan to sort according to given pathkeys
+ *
+ *	  'lefttree' is the node which yields input tuples
+ *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *	  'relids' is the set of relations required by prepare_sort_from_pathkeys()
+ *	  'nPresortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+								   Relids relids, int nPresortedCols)
+{
+	int			numsortkeys;
+	AttrNumber *sortColIdx;
+	Oid		   *sortOperators;
+	Oid		   *collations;
+	bool	   *nullsFirst;
+
+	/* Compute sort column info, and adjust lefttree as needed */
+	lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+										  relids,
+										  NULL,
+										  false,
+										  &numsortkeys,
+										  &sortColIdx,
+										  &sortOperators,
+										  &collations,
+										  &nullsFirst);
+
+	/* Now build the Sort node */
+	return make_incrementalsort(lefttree, numsortkeys, nPresortedCols,
+								sortColIdx, sortOperators,
+								collations, nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *	  Create sort plan to sort according to given sortclauses
@@ -6774,6 +6889,7 @@ is_projection_capable_path(Path *path)
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f52226ccec..aeb83841d7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4924,13 +4924,16 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider are an explicit full sort
+ * and incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
  * limit_tuples: estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate
+ *
+ * XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
+ * other pathkeys (grouping, ...) like generate_useful_gather_paths.
  */
 static RelOptInfo *
 create_ordered_paths(PlannerInfo *root,
@@ -4964,29 +4967,77 @@ create_ordered_paths(PlannerInfo *root,
 
 	foreach(lc, input_rel->pathlist)
 	{
-		Path	   *path = (Path *) lfirst(lc);
+		Path	   *input_path = (Path *) lfirst(lc);
+		Path	   *sorted_path = input_path;
 		bool		is_sorted;
+		int			presorted_keys;
+
+		is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
+												 input_path->pathkeys, &presorted_keys);
 
-		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-										  path->pathkeys);
-		if (path == cheapest_input_path || is_sorted)
+		if (is_sorted)
 		{
-			if (!is_sorted)
+			/* Use the input path as is, but add a projection step if needed */
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
+
+			add_path(ordered_rel, sorted_path);
+		}
+		else
+		{
+			/*
+			 * Try adding an explicit sort, but only to the cheapest total path
+			 * since a full sort should generally add the same cost to all
+			 * paths.
+			 */
+			if (input_path == cheapest_input_path)
 			{
-				/* An explicit sort here can take advantage of LIMIT */
-				path = (Path *) create_sort_path(root,
-												 ordered_rel,
-												 path,
-												 root->sort_pathkeys,
-												 limit_tuples);
+				/*
+				 * Sort the cheapest input path. An explicit sort here can
+				 * take advantage of LIMIT.
+				 */
+				sorted_path = (Path *) create_sort_path(root,
+														ordered_rel,
+														input_path,
+														root->sort_pathkeys,
+														limit_tuples);
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
 			}
 
+			/*
+			 * If incremental sort is enabled, then try it as well. Unlike with
+			 * regular sorts, we can't just look at the cheapest path, because
+			 * the cost of incremental sort depends on how well presorted the
+			 * path is. Additionally incremental sort may enable a cheaper
+			 * startup path to win out despite higher total cost.
+			 */
+			if (!enable_incrementalsort)
+				continue;
+
+			/* Likewise, if the path can't be used for incremental sort. */
+			if (!presorted_keys)
+				continue;
+
+			/* Also consider incremental sort. */
+			sorted_path = (Path *) create_incremental_sort_path(root,
+																ordered_rel,
+																input_path,
+																root->sort_pathkeys,
+																presorted_keys,
+																limit_tuples);
+
 			/* Add projection step if needed */
-			if (path->pathtarget != target)
-				path = apply_projection_to_path(root, ordered_rel,
-												path, target);
+			if (sorted_path->pathtarget != target)
+				sorted_path = apply_projection_to_path(root, ordered_rel,
+													   sorted_path, target);
 
-			add_path(ordered_rel, path);
+			add_path(ordered_rel, sorted_path);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..2b676bf406 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -678,6 +678,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3650e8329d..b02fcb9bfe 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2688,6 +2688,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
 		case T_Hash:
 		case T_Material:
 		case T_Sort:
+		case T_IncrementalSort:
 		case T_Unique:
 		case T_SetOp:
 		case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 4e798b801a..eebf167b0a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -776,12 +776,11 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
 
 		/*
-		 * Unless pathkeys are incompatible, see if one of the paths dominates
-		 * the other (both in startup and total cost). It may happen that one
-		 * path has lower startup cost, the other has lower total cost.
-		 *
-		 * XXX Perhaps we could do this only when incremental sort is enabled,
-		 * and use the simpler version (comparing just total cost) otherwise?
+		 * When one path dominates the other one (both in startup and total cost)
+		 * see if the pathkeys are compatible. It may happen that one path has
+		 * lower startup cost, the other has lower total cost. We do it in this
+		 * order (first costs, then pathkeys) because pathkey comparison is more
+		 * expensive.
 		 */
 		if (keyscmp != PATHKEYS_DIFFERENT)
 		{
@@ -2754,6 +2753,57 @@ create_set_projection_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *	  Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *		already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *		or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+							 RelOptInfo *rel,
+							 Path *subpath,
+							 List *pathkeys,
+							 int presorted_keys,
+							 double limit_tuples)
+{
+	IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+	SortPath   *pathnode = &sort->spath;
+
+	pathnode->path.pathtype = T_IncrementalSort;
+	pathnode->path.parent = rel;
+	/* Sort doesn't project, so use source path's pathtarget */
+	pathnode->path.pathtarget = subpath->pathtarget;
+	/* For now, assume we are above any joins, so no parameterization */
+	pathnode->path.param_info = NULL;
+	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel &&
+		subpath->parallel_safe;
+	pathnode->path.parallel_workers = subpath->parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
+
+	pathnode->subpath = subpath;
+
+	cost_incremental_sort(&pathnode->path,
+						  root, pathkeys, presorted_keys,
+						  subpath->startup_cost,
+						  subpath->total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,	/* XXX comparison_cost shouldn't be 0? */
+						  work_mem, limit_tuples);
+
+	sort->nPresortedCols = presorted_keys;
+
+	return pathnode;
+}
+
 /*
  * create_sort_path
  *	  Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 64dc9fbd13..4b91222cd9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -990,6 +990,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of incremental sort steps."),
+			NULL
+		},
+		&enable_incrementalsort,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e904fa7300..9f77328349 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -359,6 +359,7 @@
 #enable_parallel_append = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_incrementalsort = on
 #enable_tidscan = on
 #enable_partitionwise_join = off
 #enable_partitionwise_aggregate = off
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..cc33a85731 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,16 @@
 #define PARALLEL_SORT(state)	((state)->shared == NULL ? 0 : \
 								 (state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
+ * allocation might possibly be lowered.  However, we don't consider array sizes
+ * less than 1024.
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool		trace_sort = false;
@@ -241,6 +251,14 @@ struct Tuplesortstate
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+	int64		maxSpace;		/* maximum amount of space occupied among sort
+								 * of groups, either in-memory or on-disk */
+	bool		isMaxSpaceDisk; /* true when maxSpace is value for on-disk
+								 * space, false when it's value for in-memory
+								 * space */
+	TupSortStatus	maxSpaceStatus;	/* sort status when maxSpace was reached */
+	MemoryContext	maincontext;	/* memory context for tuple sort metadata that
+								 * persists across multiple batches */
 	MemoryContext sortcontext;	/* memory context holding most sort data */
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
@@ -591,6 +609,7 @@ struct Sharedsort
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  bool randomAccess);
+static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state, bool mergeruns);
@@ -647,6 +666,8 @@ static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,8 +703,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 					   bool randomAccess)
 {
 	Tuplesortstate *state;
+	MemoryContext maincontext;
 	MemoryContext sortcontext;
-	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/* See leader_takeover_tapes() remarks on randomAccess support */
@@ -691,31 +712,31 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
-	 * Create a working memory context for this sort operation. All data
-	 * needed by the sort will live inside this context.
+	 * Memory context surviving tuplesort_reset.  This memory context holds
+	 * data which is useful to keep while sorting multiple similar batches.
 	 */
-	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+	maincontext = AllocSetContextCreate(CurrentMemoryContext,
 										"TupleSort main",
 										ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Caller tuple (e.g. IndexTuple) memory context.
-	 *
-	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * Create a working memory context for one sort operation.  The content of
+	 * this context is deleted by tuplesort_reset.
+	 */
+	sortcontext = AllocSetContextCreate(maincontext,
+										"TupleSort sort",
+										ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * Additionally a working memory context for tuples is setup in
+	 * tuplesort_begin_batch.
 	 */
-	tuplecontext = AllocSetContextCreate(sortcontext,
-										 "Caller tuples",
-										 ALLOCSET_DEFAULT_SIZES);
 
 	/*
-	 * Make the Tuplesortstate within the per-sort context.  This way, we
+	 * Make the Tuplesortstate within the per-sortstate context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
-	oldcontext = MemoryContextSwitchTo(sortcontext);
+	oldcontext = MemoryContextSwitchTo(maincontext);
 
 	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -724,11 +745,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
-	state->bounded = false;
 	state->tuples = true;
-	state->boundUsed = false;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -737,38 +755,21 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	 * with very little memory.
 	 */
 	state->allowedMem = Max(workMem, 64) * (int64) 1024;
-	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
-	state->tuplecontext = tuplecontext;
-	state->tapeset = NULL;
-
-	state->memtupcount = 0;
+	state->maincontext = maincontext;
 
 	/*
 	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
 	 * see comments in grow_memtuples().
 	 */
-	state->memtupsize = Max(1024,
-							ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
-	state->growmemtuples = true;
-	state->slabAllocatorUsed = false;
-	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
-
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/* workMem must be large enough for the minimal memtuples array */
-	if (LACKMEM(state))
-		elog(ERROR, "insufficient memory allowed for sort");
-
-	state->currentRun = 0;
+	state->memtupsize = INITIAL_MEMTUPSIZE;
+	state->memtuples = NULL;
 
 	/*
-	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
-	 * inittapes(), if needed
+	 * After all of the other non-parallel-related state, we setup all of the
+	 * state needed for each batch.
 	 */
-
-	state->result_tape = -1;	/* flag that result tape has not been formed */
+	tuplesort_begin_batch(state);
 
 	/*
 	 * Initialize parallel-related state based on coordination information
@@ -802,6 +803,77 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 	return state;
 }
 
+/*
+ *		tuplesort_begin_batch
+ *
+ * Setup, or reset, all state need for processing a new set of tuples with this
+ * sort state. Called both from tuplesort_begin_common (the first time sorting
+ * with this sort state) and tuplesort_reseti (for subsequent usages).
+ */
+static void
+tuplesort_begin_batch(Tuplesortstate *state)
+{
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
+
+	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child context used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces
+	 * fragmentation. Note that the memtuples array of SortTuples is allocated
+	 * in the parent context, not this context, because there is no need to
+	 * free memtuples early.
+	 */
+	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+												"Caller tuples",
+												ALLOCSET_DEFAULT_SIZES);
+
+	state->status = TSS_INITIAL;
+	state->bounded = false;
+	state->boundUsed = false;
+
+	state->availMem = state->allowedMem;
+
+	state->tapeset = NULL;
+
+	state->memtupcount = 0;
+
+	/*
+	 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
+	 * see comments in grow_memtuples().
+	 */
+	state->growmemtuples = true;
+	state->slabAllocatorUsed = false;
+	if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
+	{
+		pfree(state->memtuples);
+		state->memtuples = NULL;
+		state->memtupsize = INITIAL_MEMTUPSIZE;
+	}
+	if (state->memtuples == NULL)
+	{
+		state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+		USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	}
+
+	/* workMem must be large enough for the minimal memtuples array */
+	if (LACKMEM(state))
+		elog(ERROR, "insufficient memory allowed for sort");
+
+	state->currentRun = 0;
+
+	/*
+	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
+	 * inittapes(), if needed
+	 */
+
+	state->result_tape = -1;	/* flag that result tape has not been formed */
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -814,7 +886,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 	AssertArg(nkeys > 0);
 
@@ -890,7 +962,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -985,7 +1057,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	MemoryContext oldcontext;
 	int			i;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1063,7 +1135,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 												   randomAccess);
 	MemoryContext oldcontext;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1106,7 +1178,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	int16		typlen;
 	bool		typbyval;
 
-	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -1224,16 +1296,23 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 }
 
 /*
- * tuplesort_end
+ * tuplesort_used_bound
  *
- *	Release resources and clean up.
+ * Allow callers to find out if the sort state was able to use a bound.
+ */
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+	return state->boundUsed;
+}
+
+/*
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *	Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
 	/* context swap probably not needed, but let's be safe */
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1291,10 +1370,104 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Free the per-sort memory context, thereby releasing all working memory,
-	 * including the Tuplesortstate struct itself.
+	 * Free the per-sort memory context, thereby releasing all working memory.
 	 */
-	MemoryContextDelete(state->sortcontext);
+	MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *	Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+	tuplesort_free(state);
+
+	/*
+	 * Free the main memory context, including the Tuplesortstate struct
+	 * itself.
+	 */
+	MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *	Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+	int64		spaceUsed;
+	bool		isSpaceDisk;
+
+	/*
+	 * Note: it might seem we should provide both memory and disk usage for a
+	 * disk-based sort.  However, the current code doesn't track memory space
+	 * accurately once we have begun to return tuples to the caller (since we
+	 * don't account for pfree's the caller is expected to do), so we cannot
+	 * rely on availMem in a disk sort.  This does not seem worth the overhead
+	 * to fix.  Is it worth creating an API for the memory context code to
+	 * tell us how much is actually used in sortcontext?
+	 */
+	if (state->tapeset)
+	{
+		isSpaceDisk = true;
+		spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+	}
+	else
+	{
+		isSpaceDisk = false;
+		spaceUsed = state->allowedMem - state->availMem;
+	}
+
+	/*
+	 * Sort evicts data to the disk when it didn't manage to fit those data to
+	 * the main memory.  This is why we assume space used on the disk to be
+	 * more important for tracking resource usage than space used in memory.
+	 * Note that amount of space occupied by some tuple set on the disk might
+	 * be less than amount of space occupied by the same tuple set in the
+	 * memory due to more compact representation.
+	 */
+	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
+		(isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
+	{
+		state->maxSpace = spaceUsed;
+		state->isMaxSpaceDisk = isSpaceDisk;
+		state->maxSpaceStatus = state->status;
+	}
+}
+
+/*
+ * tuplesort_reset
+ *
+ *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *	a new sort.  This allows avoiding recreation of tuple sort states (and
+ *	save resources) when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	tuplesort_updatemax(state);
+	tuplesort_free(state);
+
+	/*
+	 * After we've freed up per-batch memory, re-setup all of the state common
+	 * to both the first batch and any subsequent batch.
+	 */
+	tuplesort_begin_batch(state);
+
+	state->lastReturnedTuple = NULL;
+	state->slabMemoryBegin = NULL;
+	state->slabMemoryEnd = NULL;
+	state->slabFreeHead = NULL;
 }
 
 /*
@@ -2591,8 +2764,7 @@ mergeruns(Tuplesortstate *state)
 	 * Reset tuple memory.  We've freed all the tuples that we previously
 	 * allocated.  We will use the slab allocator from now on.
 	 */
-	MemoryContextDelete(state->tuplecontext);
-	state->tuplecontext = NULL;
+	MemoryContextResetOnly(state->tuplecontext);
 
 	/*
 	 * We no longer need a large memtuples array.  (We will allocate a smaller
@@ -2642,7 +2814,8 @@ mergeruns(Tuplesortstate *state)
 	 * from each input tape.
 	 */
 	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+														numInputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
 	/*
@@ -3138,18 +3311,15 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
 	 */
-	if (state->tapeset)
-	{
+	tuplesort_updatemax(state);
+
+	if (state->isMaxSpaceDisk)
 		stats->spaceType = SORT_SPACE_TYPE_DISK;
-		stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
-	}
 	else
-	{
 		stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-		stats->spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
-	}
+	stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-	switch (state->status)
+	switch (state->maxSpaceStatus)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index 2e9920111f..4af6e0013d 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)				nodeDisplay(l)
 #define SO_printf(s)					printf(s)
 #define SO1_printf(s, p)				printf(s, p)
+#define SO2_printf(s, p1, p2)			printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif							/* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..e62c02a4f3
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node);
+
+#endif							/* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0fb5d61a3f..fb490b404c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1982,6 +1982,21 @@ typedef struct MaterialState
 	Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *	 When performing sorting by multiple keys, it's possible that the input
+ *	 dataset is already sorted on a prefix of those keys. We call these
+ *	 "presorted keys".
+ *	 PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+	FmgrInfo	flinfo;			/* comparison function info */
+	FunctionCallInfo fcinfo;	/* comparison function call info */
+	OffsetNumber attno;			/* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *	 Shared memory container for per-worker sort information
  * ----------------
@@ -2010,6 +2025,71 @@ typedef struct SortState
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
+/* ----------------
+ *	 Instrumentation information for IncrementalSort
+ * ----------------
+ */
+typedef struct IncrementalSortGroupInfo
+{
+	int64		groupCount;
+	long		maxDiskSpaceUsed;
+	long		totalDiskSpaceUsed;
+	long		maxMemorySpaceUsed;
+	long		totalMemorySpaceUsed;
+	bits32		sortMethods; /* bitmask of TuplesortMethod */
+} IncrementalSortGroupInfo;
+
+typedef struct IncrementalSortInfo
+{
+	IncrementalSortGroupInfo fullsortGroupInfo;
+	IncrementalSortGroupInfo prefixsortGroupInfo;
+} IncrementalSortInfo;
+
+/* ----------------
+ *	 Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct SharedIncrementalSortInfo
+{
+	int			num_workers;
+	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *	 IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+	INCSORT_LOADFULLSORT,
+	INCSORT_LOADPREFIXSORT,
+	INCSORT_READFULLSORT,
+	INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		bounded;		/* is the result set bounded? */
+	int64		bound;			/* if bounded, how many tuples are needed */
+	bool		outerNodeDone;	/* finished fetching tuples from outer node */
+	int64		bound_Done;		/* value of bound we did the sort with */
+	IncrementalSortExecutionStatus execution_status;
+	int64		n_fullsort_remaining;
+	Tuplesortstate *fullsort_state; /* private state of tuplesort.c */
+	Tuplesortstate *prefixsort_state;	/* private state of tuplesort.c */
+	/* the keys by which the input path is already sorted */
+	PresortedKeyData *presorted_keys;
+
+	IncrementalSortInfo incsort_info;
+
+	/* slot for pivot tuple defining values of presorted keys within group */
+	TupleTableSlot *group_pivot;
+	TupleTableSlot *transfer_tuple;
+	bool		am_worker;		/* are we a worker? */
+	SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *	GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..50b1ba5186 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
 	T_HashJoin,
 	T_Material,
 	T_Sort,
+	T_IncrementalSort,
 	T_Group,
 	T_Agg,
 	T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
 	T_HashJoinState,
 	T_MaterialState,
 	T_SortState,
+	T_IncrementalSortState,
 	T_GroupState,
 	T_AggState,
 	T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
 	T_ProjectionPath,
 	T_ProjectSetPath,
 	T_SortPath,
+	T_IncrementalSortPath,
 	T_GroupPath,
 	T_UpperUniquePath,
 	T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 5334a73b53..bb2cb70709 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1621,6 +1621,15 @@ typedef struct SortPath
 	Path	   *subpath;		/* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+	SortPath	spath;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..be8ef54a1e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -774,6 +774,16 @@ typedef struct Sort
 	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Sort;
 
+/* ----------------
+ *		incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+	Sort		sort;
+	int			nPresortedCols;	/* number of presorted columns */
+} IncrementalSort;
+
 /* ---------------
  *	 group node -
  *		Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 735ba09650..9710e5c0a4 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_hashagg_disk;
 extern PGDLLIMPORT bool enable_groupingsets_hash_disk;
@@ -103,6 +104,11 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, Cost input_cost, double tuples, int width,
 					  Cost comparison_cost, int sort_mem,
 					  double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Cost input_startup_cost, Cost input_total_cost,
+								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
+								  double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 							  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e73c5637cc..b145a8da97 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -185,6 +185,12 @@ extern ProjectSetPath *create_set_projection_path(PlannerInfo *root,
 												  RelOptInfo *rel,
 												  Path *subpath,
 												  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  Path *subpath,
+											  List *pathkeys,
+											  int presorted_keys,
+											  double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
 								  RelOptInfo *rel,
 								  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..ed50092bc7 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,7 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_count_contained_in(List *keys1, List *keys2, int *n_common);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index a2fdd3fcd3..8d00a9e501 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -61,14 +61,17 @@ typedef struct SortCoordinateData *SortCoordinate;
  * Data structures for reporting sort statistics.  Note that
  * TuplesortInstrumentation can't contain any pointers because we
  * sometimes put it in shared memory.
+ *
+ * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
+ * instrumentation so needs to have each value be a separate bit.
  */
 typedef enum
 {
-	SORT_TYPE_STILL_IN_PROGRESS = 0,
-	SORT_TYPE_TOP_N_HEAPSORT,
-	SORT_TYPE_QUICKSORT,
-	SORT_TYPE_EXTERNAL_SORT,
-	SORT_TYPE_EXTERNAL_MERGE
+	SORT_TYPE_STILL_IN_PROGRESS = 1 << 0,
+	SORT_TYPE_TOP_N_HEAPSORT = 1 << 1,
+	SORT_TYPE_QUICKSORT = 1 << 2,
+	SORT_TYPE_EXTERNAL_SORT = 1 << 3,
+	SORT_TYPE_EXTERNAL_MERGE = 1 << 4
 } TuplesortMethod;
 
 typedef enum
@@ -215,6 +218,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
 								   TupleTableSlot *slot);
@@ -239,6 +243,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..288a5b2101
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1399 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+create table t(a integer, b integer);
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+                                 explain_analyze_without_memory                                 
+------------------------------------------------------------------------------------------------
+ Limit (actual rows=55 loops=1)
+   ->  Incremental Sort (actual rows=55 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 55,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 2,                   +
+             "Sort Methods Used": [              +
+                 "top-N heapsort",               +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+                   QUERY PLAN                   
+------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t_1.a = t.a)
+   ->  Seq Scan on t
+         Filter: (a = ANY ('{1,2}'::integer[]))
+   ->  Incremental Sort
+         Sort Key: t_1.a, t_1.b
+         Presorted Key: t_1.a
+         ->  Sort
+               Sort Key: t_1.a
+               ->  Seq Scan on t t_1
+(10 rows)
+
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+ a | b | a | b 
+---+---+---+---
+ 1 | 1 | 1 | 1
+ 2 | 2 | 2 | 2
+(2 rows)
+
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+                                                           explain_analyze_without_memory                                                            
+-----------------------------------------------------------------------------------------------------------------------------------------------------
+ Limit (actual rows=70 loops=1)
+   ->  Incremental Sort (actual rows=70 loops=1)
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=100 loops=1)
+               Sort Key: t.a
+               Sort Method: quicksort  Memory: NNkB
+               ->  Seq Scan on t (actual rows=100 loops=1)
+(9 rows)
+
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+                   jsonb_pretty                   
+--------------------------------------------------
+ [                                               +
+     {                                           +
+         "Sort Key": [                           +
+             "t.a",                              +
+             "t.b"                               +
+         ],                                      +
+         "Node Type": "Incremental Sort",        +
+         "Actual Rows": 70,                      +
+         "Actual Loops": 1,                      +
+         "Presorted Key": [                      +
+             "t.a"                               +
+         ],                                      +
+         "Parallel Aware": false,                +
+         "Full-sort Groups": {                   +
+             "Group Count": 1,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Presorted Groups": {                   +
+             "Group Count": 5,                   +
+             "Sort Methods Used": [              +
+                 "quicksort"                     +
+             ],                                  +
+             "Sort Space Memory": {              +
+                 "Average Sort Space Used": "NN",+
+                 "Maximum Sort Space Used": "NN" +
+             }                                   +
+         },                                      +
+         "Parent Relationship": "Outer"          +
+     }                                           +
+ ]
+(1 row)
+
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+ explain_analyze_inc_sort_nodes_verify_invariants 
+--------------------------------------------------
+ t
+(1 row)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index a4dc12b5d6..06999785f4 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -11,6 +11,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 715842b87a..a126f0ad61 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -78,6 +78,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_hashagg                 | on
  enable_hashagg_disk            | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -91,7 +92,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(19 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index a98dba7b2f..a741e89616 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 3f66e0b859..1a6821ca46 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -89,6 +89,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b990b3b3de
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,194 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+create table t(a integer, b integer);
+
+create or replace function explain_analyze_without_memory(query text)
+returns table (out_line text) language plpgsql
+as
+$$
+declare
+  line text;
+begin
+  for line in
+    execute 'explain (analyze, costs off, summary off, timing off) ' || query
+  loop
+    out_line := regexp_replace(line, '\d+kB', 'NNkB', 'g');
+    return next;
+  end loop;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+  element jsonb;
+  matching_nodes jsonb := '[]'::jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  while jsonb_array_length(elements) > 0 loop
+    element := elements->0;
+    elements := elements - 0;
+    case jsonb_typeof(element)
+    when 'array' then
+      if jsonb_array_length(element) > 0 then
+        elements := elements || element;
+      end if;
+    when 'object' then
+      if element ? 'Plan' then
+        elements := elements || jsonb_build_array(element->'Plan');
+        element := element - 'Plan';
+      else
+        if element ? 'Plans' then
+          elements := elements || jsonb_build_array(element->'Plans');
+          element := element - 'Plans';
+        end if;
+        if (element->>'Node Type')::text = 'Incremental Sort' then
+          matching_nodes := matching_nodes || element;
+        end if;
+      end if;
+    end case;
+  end loop;
+  return matching_nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_without_memory(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  nodes jsonb := '[]'::jsonb;
+  node jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
+        node := jsonb_set(node, array[group_key, space_key, 'Maximum Sort Space Used'], '"NN"', false);
+      end loop;
+    end loop;
+    nodes := nodes || node;
+  end loop;
+  return nodes;
+end;
+$$;
+
+create or replace function explain_analyze_inc_sort_nodes_verify_invariants(query text)
+returns bool language plpgsql
+as
+$$
+declare
+  node jsonb;
+  group_stats jsonb;
+  group_key text;
+  space_key text;
+begin
+  for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+      group_stats := node->group_key;
+      for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
+        if (group_stats->space_key->'Maximum Sort Space Used')::bigint < (group_stats->space_key->'Maximum Sort Space Used')::bigint then
+          raise exception '% has invalid max space < average space', group_key;
+        end if;
+      end loop;
+    end loop;
+  end loop;
+  return true;
+end;
+$$;
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+-- Test EXPLAIN ANALYZE with only a fullsort group.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 55');
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+-- Test rescan.
+begin;
+-- We force the planner to choose a plan with incremental sort on the right side
+-- of a nested loop join node. That way we trigger the rescan code path.
+set local enable_hashjoin = off;
+set local enable_mergejoin = off;
+set local enable_material = off;
+set local enable_sort = off;
+explain (costs off) select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+select * from t left join (select * from (select * from t order by a) v order by a, b) s on s.a = t.a where t.a in (1, 2);
+rollback;
+-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
+select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
+select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
+select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select * from t order by a) s order by a, b limit 70');
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index 946197fafc..05baef1106 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -12,6 +12,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.21.1

v54-0005-Consider-incremental-sort-paths-in-additional-places.patchtext/plain; charset=us-asciiDownload

From 224821ca92bd0efe0239d40319649fb9fa1b76a3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 28 Jul 2019 15:59:05 +0200
Subject: [PATCH 5/5] Consider incremental sort paths in additional places

---
 contrib/postgres_fdw/postgres_fdw.c     |  29 --
 src/backend/optimizer/geqo/geqo_eval.c  |   2 +-
 src/backend/optimizer/path/allpaths.c   | 217 +++++++++++++-
 src/backend/optimizer/path/equivclass.c |  28 ++
 src/backend/optimizer/plan/planner.c    | 373 +++++++++++++++++++++++-
 src/include/optimizer/paths.h           |   3 +
 6 files changed, 612 insertions(+), 40 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..9fc53cad68 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6523,35 +6523,6 @@ conversion_error_callback(void *arg)
 	}
 }
 
-/*
- * Find an equivalence class member expression, all of whose Vars, come from
- * the indicated relation.
- */
-Expr *
-find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
-{
-	ListCell   *lc_em;
-
-	foreach(lc_em, ec->ec_members)
-	{
-		EquivalenceMember *em = lfirst(lc_em);
-
-		if (bms_is_subset(em->em_relids, rel->relids) &&
-			!bms_is_empty(em->em_relids))
-		{
-			/*
-			 * If there is more than one equivalence member whose Vars are
-			 * taken entirely from this relation, we'll be content to choose
-			 * any one of those.
-			 */
-			return em->em_expr;
-		}
-	}
-
-	/* We didn't find any suitable equivalence class expression */
-	return NULL;
-}
-
 /*
  * Find an equivalence class member expression to be computed as a sort column
  * in the given target.
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6d897936d7..ff33acc7b6 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -274,7 +274,7 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, int num_gene,
 				 * grouping_planner).
 				 */
 				if (old_clump->size + new_clump->size < num_gene)
-					generate_gather_paths(root, joinrel, false);
+					generate_useful_gather_paths(root, joinrel, false);
 
 				/* Find and save the cheapest paths for this joinrel */
 				set_cheapest(joinrel);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ccf46dd0aa..255f56b827 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -556,7 +556,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		bms_membership(root->all_baserels) != BMS_SINGLETON)
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/* Now find the cheapest of the paths for this rel */
 	set_cheapest(rel);
@@ -2727,6 +2727,219 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 	}
 }
 
+/*
+ * get_useful_pathkeys_for_relation
+ *		Determine which orderings of a relation might be useful.
+ *
+ * Getting data in sorted order can be useful either because the requested
+ * order matches the final output ordering for the overall query we're
+ * planning, or because it enables an efficient merge join.  Here, we try
+ * to figure out which pathkeys to consider.
+ *
+ * This allows us to do incremental sort on top of an index scan under a gather
+ * merge node, i.e. parallelized.
+ *
+ * XXX At the moment this can only ever return a list with a single element,
+ * because it looks at query_pathkeys only. So we might return the pathkeys
+ * directly, but it seems plausible we'll want to consider other orderings
+ * in the future. For example, we might want to consider pathkeys useful for
+ * merge joins.
+ */
+static List *
+get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
+{
+	List	   *useful_pathkeys_list = NIL;
+
+	/*
+	 * Considering query_pathkeys is always worth it, because it might allow us
+	 * to avoid a total sort when we have a partially presorted path available.
+	 */
+	if (root->query_pathkeys)
+	{
+		ListCell   *lc;
+		int		npathkeys = 0;	/* useful pathkeys */
+
+		foreach(lc, root->query_pathkeys)
+		{
+			PathKey    *pathkey = (PathKey *) lfirst(lc);
+			EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+
+			/*
+			 * We can only build an Incremental Sort for pathkeys which contain
+			 * an EC member in the current relation, so ignore any suffix of the
+			 * list as soon as we find a pathkey without an EC member the
+			 * relation.
+			 *
+			 * By still returning the prefix of the pathkeys list that does meet
+			 * criteria of EC membership in the current relation, we enable not
+			 * just an incremental sort on the entirety of query_pathkeys but
+			 * also incremental sort below a JOIN.
+			 */
+			if (!find_em_expr_for_rel(pathkey_ec, rel))
+				break;
+
+			npathkeys++;
+		}
+
+		/*
+		 * The whole query_pathkeys list matches, so append it directly, to allow
+		 * comparing pathkeys easily by comparing list pointer. If we have to truncate
+		 * the pathkeys, we gotta do a copy though.
+		 */
+		if (npathkeys == list_length(root->query_pathkeys))
+			useful_pathkeys_list = lappend(useful_pathkeys_list,
+										   root->query_pathkeys);
+		else if (npathkeys > 0)
+			useful_pathkeys_list = lappend(useful_pathkeys_list,
+										   list_truncate(list_copy(root->query_pathkeys),
+														 npathkeys));
+	}
+
+	return useful_pathkeys_list;
+}
+
+/*
+ * generate_useful_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather or
+ *		Gather Merge on top of a partial path.
+ *
+ * Unlike plain generate_gather_paths, this looks both at pathkeys of input
+ * paths (aiming to preserve the ordering), but also considers ordering that
+ * might be useful for nodes above the gather merge node, and tries to add
+ * a sort (regular or incremental) to provide that.
+ */
+void
+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+{
+	ListCell   *lc;
+	double		rows;
+	double	   *rowsp = NULL;
+	List	   *useful_pathkeys_list = NIL;
+	Path	   *cheapest_partial_path = NULL;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/* Should we override the rel's rowcount estimate? */
+	if (override_rows)
+		rowsp = &rows;
+
+	/* generate the regular gather (merge) paths */
+	generate_gather_paths(root, rel, override_rows);
+
+	/* consider incremental sort for interesting orderings */
+	useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
+
+	/* used for explicit (full) sort paths */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+
+	/*
+	 * Consider incremental sort paths for each interesting ordering.
+	 */
+	foreach(lc, useful_pathkeys_list)
+	{
+		List	   *useful_pathkeys = lfirst(lc);
+		ListCell   *lc2;
+		bool		is_sorted;
+		int			presorted_keys;
+
+		foreach(lc2, rel->partial_pathlist)
+		{
+			Path	   *subpath = (Path *) lfirst(lc2);
+			GatherMergePath *path;
+
+			/*
+			 * If the path has no ordering at all, then we can't use either
+			 * incremental sort or rely on implict sorting with a gather merge.
+			 */
+			if (subpath->pathkeys == NIL)
+				continue;
+
+			is_sorted = pathkeys_count_contained_in(useful_pathkeys,
+													 subpath->pathkeys,
+													 &presorted_keys);
+
+			/*
+			 * We don't need to consider the case where a subpath is already
+			 * fully sorted because generate_gather_paths already creates a
+			 * gather merge path for every subpath that has pathkeys present.
+			 *
+			 * But since the subpath is already sorted, we know we don't need
+			 * to consider adding a sort (other either kind) on top of it, so
+			 * we can continue here.
+			 */
+			if (is_sorted)
+				continue;
+
+			/*
+			 * Consider regular sort for the cheapest partial path (for each
+			 * useful pathkeys). We know the path is not sorted, because we'd
+			 * not get here otherwise.
+			 *
+			 * This is not redundant with the gather paths created in
+			 * generate_gather_paths, because that doesn't generate ordered
+			 * output. Here we add an explicit sort to match the useful
+			 * ordering.
+			 */
+			if (cheapest_partial_path == subpath)
+			{
+				Path	   *tmp;
+
+				tmp = (Path *) create_sort_path(root,
+												rel,
+												subpath,
+												useful_pathkeys,
+												-1.0);
+
+				rows = tmp->rows * tmp->parallel_workers;
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+
+				/* Fall through */
+			}
+
+			/*
+			 * Consider incremental sort, but only when the subpath is already
+			 * partially sorted on a pathkey prefix.
+			 */
+			if (enable_incrementalsort && presorted_keys > 0)
+			{
+				Path	   *tmp;
+
+				/*
+				 * We should have already excluded pathkeys of length 1 because
+				 * then presorted_keys > 0 would imply is_sorted was true.
+				 */
+				Assert(list_length(useful_pathkeys) != 1);
+
+				tmp = (Path *) create_incremental_sort_path(root,
+															rel,
+															subpath,
+															useful_pathkeys,
+															presorted_keys,
+															-1);
+
+				path = create_gather_merge_path(root, rel,
+												tmp,
+												rel->reltarget,
+												tmp->pathkeys,
+												NULL,
+												rowsp);
+
+				add_path(rel, &path->path);
+			}
+		}
+	}
+}
+
 /*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
@@ -2899,7 +3112,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
 			 * once we know the final targetlist (see grouping_planner).
 			 */
 			if (lev < levels_needed)
-				generate_gather_paths(root, rel, false);
+				generate_useful_gather_paths(root, rel, false);
 
 			/* Find and save the cheapest paths for this rel */
 			set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef12547ee..b99cec00cb 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -774,6 +774,34 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * Find an equivalence class member expression, all of whose Vars, come from
+ * the indicated relation.
+ */
+Expr *
+find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+	ListCell   *lc_em;
+
+	foreach(lc_em, ec->ec_members)
+	{
+		EquivalenceMember *em = lfirst(lc_em);
+
+		if (bms_is_subset(em->em_relids, rel->relids) &&
+			!bms_is_empty(em->em_relids))
+		{
+			/*
+			 * If there is more than one equivalence member whose Vars are
+			 * taken entirely from this relation, we'll be content to choose
+			 * any one of those.
+			 */
+			return em->em_expr;
+		}
+	}
+
+	/* We didn't find any suitable equivalence class expression */
+	return NULL;
+}
 
 /*
  * generate_base_implied_equalities
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index aeb83841d7..9608fdaec8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5090,6 +5090,71 @@ create_ordered_paths(PlannerInfo *root,
 
 			add_path(ordered_rel, path);
 		}
+
+		/*
+		 * Consider incremental sort with a gather merge on partial paths.
+		 *
+		 * XXX This is probably duplicate with the paths we already generate
+		 * in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
+		 *
+		 * We can also skip the entire loop when we only have a single-item
+		 * sort_pathkeys because then we can't possibly have a presorted
+		 * prefix of the list without having the list be fully sorted.
+		 */
+		if (enable_incrementalsort && list_length(root->sort_pathkeys) > 1)
+		{
+			ListCell   *lc;
+
+			foreach(lc, input_rel->partial_pathlist)
+			{
+				Path	   *input_path = (Path *) lfirst(lc);
+				Path	   *sorted_path = input_path;
+				bool		is_sorted;
+				int			presorted_keys;
+				double		total_groups;
+
+				/*
+				 * We don't care if this is the cheapest partial path - we can't
+				 * simply skip it, because it may be partially sorted in which
+				 * case we want to consider adding incremental sort (instead of
+				 * full sort, which is what happens above).
+				 */
+
+				is_sorted = pathkeys_count_contained_in(root->sort_pathkeys,
+														 input_path->pathkeys,
+														 &presorted_keys);
+
+				/* No point in adding incremental sort on fully sorted paths. */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				sorted_path = (Path *) create_incremental_sort_path(root,
+																	ordered_rel,
+																	input_path,
+																	root->sort_pathkeys,
+																	presorted_keys,
+																	limit_tuples);
+				total_groups = input_path->rows *
+					input_path->parallel_workers;
+				sorted_path = (Path *)
+					create_gather_merge_path(root, ordered_rel,
+											 sorted_path,
+											 sorted_path->pathtarget,
+											 root->sort_pathkeys, NULL,
+											 &total_groups);
+
+				/* Add projection step if needed */
+				if (sorted_path->pathtarget != target)
+					sorted_path = apply_projection_to_path(root, ordered_rel,
+														   sorted_path, target);
+
+				add_path(ordered_rel, sorted_path);
+			}
+		}
 	}
 
 	/*
@@ -6444,10 +6509,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		foreach(lc, input_rel->pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
 			if (path == cheapest_path || is_sorted)
 			{
 				/* Sort the cheapest-total path if it isn't already sorted */
@@ -6503,6 +6572,79 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					Assert(false);
 				}
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			/* no shared prefix, no point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			/*
+			 * We should have already excluded pathkeys of length 1 because
+			 * then presorted_keys > 0 would imply is_sorted was true.
+			 */
+			Assert(list_length(root->group_pathkeys) != 1);
+
+			path = (Path *) create_incremental_sort_path(root,
+														 grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			/* Now decide what to stick atop it */
+			if (parse->groupingSets)
+			{
+				consider_groupingsets_paths(root, grouped_rel,
+											path, true, can_hash,
+											gd, agg_costs, dNumGroups);
+			}
+			else if (parse->hasAggs)
+			{
+				/*
+				 * We have aggregation, possibly with plain GROUP BY. Make
+				 * an AggPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_agg_path(root,
+										 grouped_rel,
+										 path,
+										 grouped_rel->reltarget,
+										 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+										 AGGSPLIT_SIMPLE,
+										 parse->groupClause,
+										 havingQual,
+										 agg_costs,
+										 dNumGroups));
+			}
+			else if (parse->groupClause)
+			{
+				/*
+				 * We have GROUP BY without aggregation or grouping sets.
+				 * Make a GroupPath.
+				 */
+				add_path(grouped_rel, (Path *)
+						 create_group_path(root,
+										   grouped_rel,
+										   path,
+										   parse->groupClause,
+										   havingQual,
+										   dNumGroups));
+			}
+			else
+			{
+				/* Other cases should have been handled above */
+				Assert(false);
+			}
 		}
 
 		/*
@@ -6514,12 +6656,19 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				Path	   *path_original = path;
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
 					if (path != partially_grouped_rel->cheapest_total_path)
 						continue;
@@ -6550,6 +6699,55 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											   parse->groupClause,
 											   havingQual,
 											   dNumGroups));
+
+				/*
+				 * Now we may consider incremental sort on this path, but only
+				 * when the path is not already sorted and when incremental
+				 * sort is enabled.
+				 */
+				if (is_sorted || !enable_incrementalsort)
+					continue;
+
+				/* Restore the input path (we might have added Sort on top). */
+				path = path_original;
+
+				/* no shared prefix, not point in building incremental sort */
+				if (presorted_keys == 0)
+					continue;
+
+				/*
+				 * We should have already excluded pathkeys of length 1 because
+				 * then presorted_keys > 0 would imply is_sorted was true.
+				 */
+				Assert(list_length(root->group_pathkeys) != 1);
+
+				path = (Path *) create_incremental_sort_path(root,
+															 grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+				else
+					add_path(grouped_rel, (Path *)
+							 create_group_path(root,
+											   grouped_rel,
+											   path,
+											   parse->groupClause,
+											   havingQual,
+											   dNumGroups));
 			}
 		}
 	}
@@ -6821,6 +7019,64 @@ create_partial_grouping_paths(PlannerInfo *root,
 											   dNumPartialGroups));
 			}
 		}
+
+		/*
+		 * Consider incremental sort on all partial paths, if enabled.
+		 *
+		 * We can also skip the entire loop when we only have a single-item
+		 * group_pathkeys because then we can't possibly have a presorted
+		 * prefix of the list without having the list be fully sorted.
+		 */
+		if (enable_incrementalsort && list_length(root->group_pathkeys) > 1)
+		{
+			foreach(lc, input_rel->pathlist)
+			{
+				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+				int			presorted_keys;
+
+				is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+														 path->pathkeys,
+														 &presorted_keys);
+
+				/* Ignore already sorted paths */
+				if (is_sorted)
+					continue;
+
+				if (presorted_keys == 0)
+					continue;
+
+				/* Since we have presorted keys, consider incremental sort. */
+				path = (Path *) create_incremental_sort_path(root,
+															 partially_grouped_rel,
+															 path,
+															 root->group_pathkeys,
+															 presorted_keys,
+															 -1.0);
+
+				if (parse->hasAggs)
+					add_path(partially_grouped_rel, (Path *)
+							 create_agg_path(root,
+											 partially_grouped_rel,
+											 path,
+											 partially_grouped_rel->reltarget,
+											 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+											 AGGSPLIT_INITIAL_SERIAL,
+											 parse->groupClause,
+											 NIL,
+											 agg_partial_costs,
+											 dNumPartialGroups));
+				else
+					add_path(partially_grouped_rel, (Path *)
+							 create_group_path(root,
+											   partially_grouped_rel,
+											   path,
+											   parse->groupClause,
+											   NIL,
+											   dNumPartialGroups));
+			}
+		}
+
 	}
 
 	if (can_sort && cheapest_partial_path != NULL)
@@ -6829,10 +7085,14 @@ create_partial_grouping_paths(PlannerInfo *root,
 		foreach(lc, input_rel->partial_pathlist)
 		{
 			Path	   *path = (Path *) lfirst(lc);
+			Path	   *path_original = path;
 			bool		is_sorted;
+			int			presorted_keys;
+
+			is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+													 path->pathkeys,
+													 &presorted_keys);
 
-			is_sorted = pathkeys_contained_in(root->group_pathkeys,
-											  path->pathkeys);
 			if (path == cheapest_partial_path || is_sorted)
 			{
 				/* Sort the cheapest partial path, if it isn't already */
@@ -6864,6 +7124,55 @@ create_partial_grouping_paths(PlannerInfo *root,
 													   NIL,
 													   dNumPartialPartialGroups));
 			}
+
+			/*
+			 * Now we may consider incremental sort on this path, but only
+			 * when the path is not already sorted and when incremental sort
+			 * is enabled.
+			 */
+			if (is_sorted || !enable_incrementalsort)
+				continue;
+
+			/* Restore the input path (we might have added Sort on top). */
+			path = path_original;
+
+			/* no shared prefix, not point in building incremental sort */
+			if (presorted_keys == 0)
+				continue;
+
+			/*
+			 * We should have already excluded pathkeys of length 1 because
+			 * then presorted_keys > 0 would imply is_sorted was true.
+			 */
+			Assert(list_length(root->group_pathkeys) != 1);
+
+			path = (Path *) create_incremental_sort_path(root,
+														 partially_grouped_rel,
+														 path,
+														 root->group_pathkeys,
+														 presorted_keys,
+														 -1.0);
+
+			if (parse->hasAggs)
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 path,
+												 partially_grouped_rel->reltarget,
+												 parse->groupClause ? AGG_SORTED : AGG_PLAIN,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			else
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_group_path(root,
+												   partially_grouped_rel,
+												   path,
+												   parse->groupClause,
+												   NIL,
+												   dNumPartialPartialGroups));
 		}
 	}
 
@@ -6961,10 +7270,11 @@ create_partial_grouping_paths(PlannerInfo *root,
 static void
 gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
+	ListCell   *lc;
 	Path	   *cheapest_partial_path;
 
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
-	generate_gather_paths(root, rel, true);
+	generate_useful_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
@@ -6990,6 +7300,53 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 
 		add_path(rel, path);
 	}
+
+	/*
+	 * Consider incremental sort on all partial paths, if enabled.
+	 *
+	 * We can also skip the entire loop when we only have a single-item
+	 * group_pathkeys because then we can't possibly have a presorted
+	 * prefix of the list without having the list be fully sorted.
+	 */
+	if (!enable_incrementalsort || list_length(root->group_pathkeys) == 1)
+		return;
+
+	/* also consider incremental sort on partial paths, if enabled */
+	foreach(lc, rel->partial_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(lc);
+		bool		is_sorted;
+		int			presorted_keys;
+		double		total_groups;
+
+		is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+												 path->pathkeys,
+												 &presorted_keys);
+
+		if (is_sorted)
+			continue;
+
+		if (presorted_keys == 0)
+			continue;
+
+		path = (Path *) create_incremental_sort_path(root,
+													 rel,
+													 path,
+													 root->group_pathkeys,
+													 presorted_keys,
+													 -1.0);
+
+		path = (Path *)
+			create_gather_merge_path(root,
+									 rel,
+									 path,
+									 rel->reltarget,
+									 root->group_pathkeys,
+									 NULL,
+									 &total_groups);
+
+		add_path(rel, path);
+	}
 }
 
 /*
@@ -7091,7 +7448,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * paths by doing it after the final scan/join target has been
 		 * applied.
 		 */
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 		/* Can't use parallel query above this level. */
 		rel->partial_pathlist = NIL;
@@ -7245,7 +7602,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	 * one of the generated paths may turn out to be the cheapest one.
 	 */
 	if (rel->consider_parallel && !IS_OTHER_REL(rel))
-		generate_gather_paths(root, rel, false);
+		generate_useful_gather_paths(root, rel, false);
 
 	/*
 	 * Reassess which paths are the cheapest, now that we've potentially added
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index ed50092bc7..c7bd30a8bf 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
 								  bool override_rows);
+extern void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+										 bool override_rows);
 extern int	compute_parallel_worker(RelOptInfo *rel, double heap_pages,
 									double index_pages, int max_workers);
 extern void create_partial_bitmap_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -135,6 +137,7 @@ extern EquivalenceClass *get_eclass_for_sort_expr(PlannerInfo *root,
 												  Index sortref,
 												  Relids rel,
 												  bool create_it);
+extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void generate_base_implied_equalities(PlannerInfo *root);
 extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
-- 
2.21.1

#287

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#286)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, Apr 05, 2020 at 03:01:10PM +0200, Tomas Vondra wrote:

On Thu, Apr 02, 2020 at 09:40:45PM -0400, James Coleman wrote:

On Thu, Apr 2, 2020 at 8:46 PM James Coleman <jtc331@gmail.com> wrote:

On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...
5) Overall, I think the costing is OK. I'm sure we'll find cases that
will need improvements, but that's fine. However, we now have

- cost_tuplesort (used to be cost_sort)
- cost_full_sort
- cost_incremental_sort
- cost_sort

I find it a bit confusing that we have cost_sort and cost_full_sort. Why
don't we just keep using the dummy path in label_sort_with_costsize?
That seems to be the only external caller outside costsize.c. Then we
could either make cost_full_sort static or get rid of it entirely.

This another area of the patch I haven't really modified.

See attached for a cleanup of this; it removed cost_fullsort so
label_sort_with_costsize is back to how it was.

I've directly merged this into the patch series; if you'd like to see
the diff I can send that along.

Thanks. Attached is v54 of the patch, with some minor changes. The main
two changes are in add_partial_path_precheck(), firstly to also consider
startup_cost, as discussed before. The second change (in 0003) is a bit
of an experiment to make add_partial_precheck() cheaper by only calling
compare_pathkeys after checking the costs first (which should be cheaper
than the function call). add_path_precheck already does it in that order
anyway.

Oh, I forgot to mention a change in add_partial_path - I've removed the
reference/dependency on enable_incrementalsort. It seemed rather ugly,
and the results without it seem fine (I'm benchmarking only the case
with incremental sort enabled anyway). I also plan to look at the other
optimization we bolted on last week, i.e. checking length of pathkeys.
I'll see if that actually makes measurable difference.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#288

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#286)

4 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Hi,

I've pushed the fist part of this patch series - I've reorganized it a
bit by moving the add_partial_path changes to the end. That way I've
been able to add regression test demonstrating impact of the change on
plans involving incremental sort nodes (which wouldn't be possible when
committing the add_partial_path first). I'll wait a bit before pushing
the two additional parts, so that if something fails we know which bit
caused it.

I've been running extensive benchmarks with the aim to detect any
regressions caused by this patch, particularly during planning. Attached
is the script I've used and spreadsheet with results. The numbers show
throughput with different queries (SELECT, EXPLANN and joins) with the
patches committed one by one. There's quite a bit of noise, even though
the script pins processes to cores and restricts CPU frequency. Overall,
I don't see any obvious regression - the numbers are generally within
0.5% of the master, and the behavior is the same even with -M prepared
which should not be subject to any planner overhead. I do have results
from another machine (2-socket Xeon) but the results are much more
noisy, although the general conclusions are about the same.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

run-benchmark.shapplication/x-shDownload

results-join-v54-i5.csvtext/csv; charset=us-asciiDownload

results-join-v54-i5.odsapplication/vnd.oasis.opendocument.spreadsheetDownload

PKIq�P�l9�..mimetypeapplication/vnd.oasis.opendocument.spreadsheetPKIq�P(��p$p$Thumbnails/thumbnail.png�PNG


IHDR�0A�L`PLTE""",,,444;;;CCCLLLTTT[[[ccckkksss|||����������������������������������������������������J�#�IDATx��]�b�H���2(��>��77R�*L��t��.)���/2���?��WO�O-����=[��������^��n��/g���h��z��t��{W�������~Z��w;�_����h�YO��Z�z+�.^ui9�}/Ry�yW\t)�9�d����+LM����V(�g:pVe�As9b�����q\�>���t����rT&{%�����������(i�����(�N�q�����#�$�?�%�|^M���W���4"J�mR��R��D�>*Av��]�����C��r��d�
����2�;9y��e�_O%V�f�c�;�&�;r�!Kw�<�$#	�f������!G"f�H� �4�.�����C
YZ��D�?ulIB=����.�=�����f9��������rH8��FD�O+sxy���RJ��E�@ZF��@����{�%��z�����]AK�X�sP,��y��r�G�U�Vk/ r����
C�!J�y&�C�P��#h)�d%�|5�f�y�����r��r��{��������q��8����e�KR��.C��P�b�]���[��k�@'��*J����R����R(I���k�E�q#��0T�,�d���i������6���p�z��'��@#7,��#r� 2��2���O�s�v�$N~J$���7n��d��������z�(��r��g}b���v�F���f��'��,�)�Iy^{%j���	��3�W�A�����������:Tk+�+D����=���O�����������=�Z�����k`����h��&�#�H���
��:*&q������'��a+��F�B���w���O��V�r�x^������^����g^�����+��\���)3f��Y������YV>�c����T�"y�D��8��7�U���V�
����������y>����RP�����$�����91�Z*����yER�v�g ���t�����}5/0jc��_�9��u�%p����=
�mL���u���f���F�Rl�\�zR+R�k����s���>=!W��ZLF*�aIkW#L*���V� i
��q�$�R�a���0 �������;�gl�a�m�������_�����$��Z�G�m{+���G�3��i.4��c����[Y����	��'��t�sI��n��w7�q��r�������~A����{L��+K6��V�b�����H�K�V)���R����kY%`�����'�b���������&�"��w�g���T�����Sa��=�Qg!P�7��$��tyR&��*L�7�2+�xm��n����$����ON�|?TB�&	pW��n�A����"vD���4����g,o|p��Z�;����^�9N�8)*�{��8c�q8��������X�?�����Zr�3
�z>�Y>���)���"Urx����0�M��[�6�OO&���������l}�����c��k!Skl��tVMp������������~�V3������r�+r��#?=��Z����_�0dkil6i�f�������a�V�i;bc���m������:��)��9S�i���?�$�Eqf<��-X.�E��oX�v��)0��y��7��8F|�
G����NL�v��f��V��`M���=?���ph�	�;��8}��Sg�#F9x�
p�j\�A>(iq����������O�np��|��
K`�y
�����p�n�-��g�A�?g�49�o2�h�Q��,���k�����s5�s|���]B�NIi����	�(�~3R9�C"��M��v$H�o���Q;�QN0B���������>Q���'z��E�y��9v�����	?�A�x���|���r�{ �l��M�����hw�R��_�k��4��Z��c���myx��pKT�������>��Wg��/�]�1��X$�B�a�����Jw����b������q������%	g�W6���L9��m�D��Q�<��Z��^�����VJ���x�1��k�15����(�29��9���^����1w��c�����6(l��8I��xm���������N��xm^�����8x-���%���$��#^��+�:�2X�6�k�.��2���]ux�q(H~��i0�`����A�y��;���5��:c�Q�k��b��������#\�����s�g��O��k�9�o��4���l�,i��PGs���|��(e������� 0c���U{���pb~�����k5�Z��0�)}������n��J2Q@E�$�����c�o����4�`��Ibp�f[WCf����1������
N#Qu����k'�r����H����k�Q��*=�Z�8�7���k��78X�
�k	L�
���s��h3�@M���|��c������K��w^r���;/��u��6`rx������7�\���I�����;?~?���(x-�T��3���M�y�;{$���.n	cwT�j��j��xmy2�(�������Q.����z�����{Y�&9�'|�C�mx�e����+�%�}q��l<�r�
�,g�2p[�f���*���.��.��yUI�q���he��{"U����+�E�q��r��}��O{X�������L�q#+�O��[V�(kf�W7�W^[kg�!i������Ra�b��r���&����2�Z'��lwyY��{�
�H(��ua����Z�8�����v$b(}r��p��@Kr3I�������wx�A��"qc�J������m�W����g�xn�e2v���V.���gpR7S�0[��Oa��k}g�9^�?����3��`���|�-���{2_��N�z����x;�<�0���:iA�������������
c��q�e�Ah��p�rD�G��;�k��JQ
~|����?'fK����3���<y�u
6�r8k�lM��"�\�����el/s���D�hU�KY�_e�3���oW
���-4��L��_@���^�x���/���P]q0Q
���&���]��2�����Z�)$�:�HW��
U�p�j'�L"��+���%`R�Q�[NUR�R�
���2p��h_�[��A�8��{����������#��,|0�3�q��8<�]Co�z-/�Rq�^���������7��P�A���6<�������$�������~���@%,8����3�6��JM�{F������}�~��}�����n{(�gUV�xOJ�����VJ���������u$�	���G��2b�����c�����F�q�<4Z��a	�c+;~�k��1���_�k%��y�7�Yr��������M���!:��������x��]�R���r��;B| L�p�2k��M
�2S.�����������r
<J Au:�59��~}����ww�����1u�(�3O|�]6,�@�q�O��	��O���c"]��T
<�vf����]��-��Z�#kU3������:�#@������������Z��q0�v'<pP��--�8���Pf���]�q�"���^0R,����W���{�`�,�h���Rp�?��T�c�4M�;/�q���r|��Sr����R#��}����IF/&�rX������)��[���wX�Y�$M�N��������	v��p	|2��++�	��.����8������Y��������S�-���l��������U\�����|��W�D����\9��|�9�~>	i/y����R�$�'`���{A�Cp��%r�y-�2�����q�&>&��K����
��h�P10�TD���N@ i�������e��hU���w~�����U����K�Y%3��p0�[�h�Q�;UR.��
��wJw��e���h�t:97N=�S�>�%�xux�&w)oNJ���.������o9�M�����~1������������m���_���m]�uN����6�� ����L{/�����3�S�?��^K�c-W?�A���B�v���s�B����o�I*����q�(��g�"������V#�a9>��
?DH"��8v�����-'7��s�u�f�9���O
-���kM}���>�}�:'.��j^[7>�#'��8u�����cs�}��V��[Zy�~b�|��x�O��t�T\!K�e:��v���C�g����3�|�������r�9/�V��&��^��:�{��W����A������q�H^[������3q�^*�u&���Y���u&��|1L+���FQJv/4�����v]���L�O��S1��H.���)c����l��G���L�)����a�g^���k�R���/�3�����S�+0�lo�[�'�FQ-h|{����:Yk���A@c��&���_����[���D�	zw��}r�|�.Hb��a?�M�W���=9K�|����?g�9�V�q�f�[�!E[2o���m�$����ct��lJ�}����:X��������5��`�*���,���D*�L��7AX� M�CR�O.J
x����4� ����izUa�8+tc�FT�{�������������!��[>�VGx/�k��H�u	�v�������5���a����i'�G�ymY��?^��p�O���.�I��n�mof��8��B���F_��rv���B����X�b��m����q;�r���a�����i������?@����+���+�V9~���=/~)���g�k�3^���i�4$���������F5c*	�HfZ���,p�u��6�
�
o+������������/qK>�b$��q�R!�6�USr�/�����$;��A?�y�&������#��I����:�/��.8H�gp0�K�%�K��4���E�����~�5�V����������
���u~j.�d�P�Z&.qK
+ZH��'Y������?Iq����]��Q���)�0pI<V���y;U7��-�:����3�������>����3I�����Q���u��K���%\�����?7uv���(�vP����n�G�7����x���{�D�?��������I����%�{�tk��nL>��v�q��|�1�����������!m��\�#�[]]o�������B0Y��M^k�g1���C����#����	�����������������ja�#G�D�K��:k=B6;Sq����+�h�$V�����,��=��L��W��X���8��Q��|��\o�x�G^*��q�E=
OyE[Lj*��y���k����8a����_�qH7m�G���P+��+����VL������;��]�\4�\�-���s��w�$�l���W����/`a����4)K�B�����j������R7f4%
<`Q�S��r;O:M��\�����l���L$l)�^��9�W9���1�w�Z��d^��V|��o|F�e����kr%��*���q���w'�VQ|��]��;L��X"9rq����60���P��U���Ldn��6�p#i���hS<b>8��}C�I�Q�F��Aq9i��2��G��V"�k�����
?�|-���)����9�s���H���.��'����z��O���e�����~�x0�~��,9���8��/��$����:��^x���9���L9�pY�L85��|��<�!B����WrT/a��?N�y�j���!_�k��N�qM����^��^����8��#�
�f������r�;6��O���4�L0�����$N������yU��[����5��v���7���>�{����E2��e
�m/����?z\��hm�{~�gt���7�������>9�'v����^����}��:;�:���u&�����}�~a���� }���,�x���/��`i���:������Mz
l��>���s/E^[G?����q�\g�/y��3�$��r9����x�WCE��}��y����R���L�����h����d"z�FM$�8����IJ���������L�u&�H�js�I�^������	W�k)5�d���*��~�/��P��x����_���O�q�1~/��:o�����T��;�J)������-@q<�'Q�6��b����)��s���S�}2F�As������������� )	������J��?'���PO�h��X��(s����Y�����L&%��x��M��������RT������D���	K��d�n��9/u�[�y�������[r_��:~�����F�s�[��(�"@�8����7G����9Mk�0�O�J��RG8Z�O�����@����K��7��q0{�A��x
[p-����s�q�"�M<�v�$k�T�~�M.����nLl��v�~8��2t�k)e���9������������'�H�^���o�f9��3�1�����%��>�
�d���14L����lf<�M!;�/�&P]������z��@�\TXb�/���#g�{�8����;����9������6�c���G�s�MB�s"�t����E����<��V������[�Z���NQ�������{��CH�|���1&��j�;���XF������O!�;��+x^�b'.�������c�������b8o����ap��9r�F�{��E��������9�����7ss��2s_��Kk���yX�s�zyJ��\�XxZ���\z9V����QW2O��d����Y�T��1��C������=\���^�7�������{��`�9����}�_�Rc�t����/S�s����(4@lhC����(Dxvu��gr��e�����yv@s��5���x����.v���T�%�t�\�1�k9
��������in	��K�5';9��L$�u4��
4�H��?N�"�������+^��b <��]�e��^l�K��;��k��k��k��?�����&��9���4�x����8�M���SS�N�4>���kY���Z��h�R����N=���vW��Z�]��s���dt�"_�q�0�CeJ6���\����x��M����+��V����M���� �a�\�����|����z���������W���=����}���r���[�U��/����2���G1I4����3?����^x�h�3�����h��>P�Wp^��P�����Dd�?�����(8u���4�<W����������q#'����_w��6
���@v��I�3=��{�Wi���q��K��1�����Z<��?j��s��E�ymJ�#��R�R��
��&#���pi.qKX�:��(�RD32�q��j��8�J�;}_6��h4��nd����3��I`���t�,s��K�L��p�="5Jo��9��6/�<�]F��GtiG�s�����_W�����������3�/��^���}��6�X<Ug�8fyD��(�f1��}O���6^��:�	�`�	/�:��yZCU���E���'�q���k�Xg��U��3�����P��KZF(��x-��k�PgB���%Kc�ym���kiw](1��#G^�q��j��(��4������������rX?��Ze���}��2�Ll��Vg�/u&�P����)|��1
P\���Z��8r]���Lhq8�C�	�������u&����o|����3q��;���l�7�mr�3qu������%�}�:v��y������U,hF���ugxD�7�$�6��W�@�w���u&��y��D��<7���4�Sk�����J���x-�(2�OJ���_����>Wg�$�F����s�M�����\��u��8d�6�+"|��D��?Wgb\�@���a�()���#������W��'�G	-~�����LL\>Ygb���+H�^�����2_���<T�p��\g���:�ge{�7�}���?�Y/����?�A�0���cG����v��v�}���������\���T�Nq����<�����
�s��hj�.����)y�u�y
��|�:�JXx;�u���t�@��������rd��U����]
������
i�� ��AX1��7����y��v���GYMC��Q�e�O5�8���W�h��7���|N�Y3�{�� ���?e����kI`�\�?�6^�?�7�;���Q�z-��j����_�[jse�K~���Z�P��]���h"�z�
�o8h���[���8��<>�����<������Ca�oE$����&���p�7�����?�-5WGjY��g���-������O����6;���po.���k�����ghsT
����>������[�x<Cc����������E�n�����n�8���������8��v�Wv�kb��h��=�<�T��HN�,����>�a�z~�Gaq�������e>@+�&q'>��������������4��8!��;%-��xy�3120�����7��K�����R`T�D>�)r��]�@O���q�{>N�Cp�h��8�n�:�O+�&���c�si�B������g��G�������7��"E���/��� �G�H���Z��@w��K�������`���C��v��s��IY[f�g�K��ak�����[g-0���>�<~1��(JV�����:�������*�/��n���/�-=�e���w��������-��U����}��u�^3�������l��R1C3h�\w�k9OaE��;���j��!F����[��t�iPW.��� �7�M��
#>�w�����[��"E���>��]����~Rv
��G?�z�N���$�|j�x`���i�v����y�u>W�uK5�S.���3����L����C�;��@���#�F^�e�^�"<o$]�Z'�4�AI}��Qx��n�\�8��u���Z��"���^�f^�<K����[����q�[>����`��qZ�X(�)��u���6��&��;s������G�9J���F����I+��?�������?�4�������������\�CIEND�B`�PKIq�Psettings.xml�[ms�6��_��w�K	���]
=z�I��&���Z�$�r���
��g�����J���YV���$s�a���5I�_[���u�A�����0��]Zz�\�bA�]t��IPJ?"O�p.����n#�T��x m���?��?mGd��-����Z)����f�9����X���n��=<� _�e^����T���(����z��:���5-�w���������U���9�_M��4������5+m��1�T�9��2C�:�T;_��\Y��U�G�7�a�� �A]�J�n6��G��.W��_v.��.��<��(wan�6����&vyL���M�)��1`���h���4a��h����/��d	9bp��%U:�L���b����T�b�C�rE��gT��.$'`���(��c
=a��Aef�s�
��E��O_f��U�`o���;�*+G��[��7�z%E�f���A	��q�7��@W������)>]��WA��{���p��D��;����`��t^0��aI�]��0�A�r�uq�R�*4�*E��	B���Z)B��"T�P)B�"�t��A���N���J*=�����z @L���Hym�>��v9�P��9��Z,T�P�C%�8����!D�<��'2�?��G:G cs"2���.:7����7�C���^@xtU ���|������	���2m&�C�2������Z�N	�j�)C"����'�qA�!���.5�G�.P�r���929����BE�P�,l!ty�����NG$�HN9�g8!RA2�� ���w����1L@���>�h]�vI����#�����k*3�/	��������[*�;����[v���������$�����A�hz�K@��9�����������=��2�%%�<�X��,�H~��6]a�����-i�Y�����;�� �}�KG�����{����p> �V�\<�.Qp+t#
:���&��3��}���M�2<6��D]�� �[����~N���1~���J$��(fA|������f*�����Y;�j���,������5S��J��y������,�����@��0'0�����W�Ar�{N$t�o('b������}o�=�A��M��e�i����#Q�@����lxsNh��/sqT��������PK��,���2PKIq�Pmanifest.rdf���n�0��<�e��@/r(��j��5�X/������VQ�������F3�����a�����T4c)%�Hh��+:�.���:���+��j���*�wn*9_��-7l���(x��<O�"��8qH���	�Bi��|9��	fWQt���y� =��:���
a�R��� ��@�	L��t��NK�3��Q9�����`����<`�+�������^����\��|�hz�czu����#�`�2�O��;y���.�����vDl@��g�����UG�PK��h��PKIq�PConfigurations2/menubar/PKIq�PConfigurations2/floater/PKIq�PConfigurations2/toolpanel/PKIq�PConfigurations2/statusbar/PKIq�PConfigurations2/accelerator/PKIq�PConfigurations2/popupmenu/PKIq�PConfigurations2/images/Bitmaps/PKIq�PConfigurations2/toolbar/PKIq�PConfigurations2/progressbar/PKIq�Pmeta.xml���o� ���+,��
��� �J;��i��J�E�fl"�u���q�L9���~�}���^'�����A4#(#�Tf�����t��]m���&���`B���&��x6��9�l��g���gA0{3��K��F�r���i����������a�Z��X�Q)���sz����ap��f����C
�e$k��h����]NH���L�����"[���
m��������/�#>ow��k)X������LI�����E�,+�P=���3V��A��,��kC�������x����u��+4x^�C�������lR[?�����l���O�t���e�������
"�]����i~2��7�?-bD�HF=�[
���	
*�$
�z��bA'�n��B�5�Z.����PKr�'^�PKIq�P
styles.xml�Zmo�6��_a����%�i{���m�mP�����%J&J�E���~GR��7Gq���$��������/�	�1�	K/���w8
XH����|s��9���Y��B��
7��N����pV<�1��|���3�X�S#4��3��Q���+�--�V����,Z�_Y�m���M_a�Nm���������$C����R�~�p�Bd3��l6��d�x������fK����8U�0�0�r��
G��&X���I�mR�J���	��j�qpW�e?E�L%��q��Z�4K�{��WSe�O�Ih�&H,;�{���I����]^���Z[�*�$���F�����T)��]�;��O_[��^������D��q������e�4�Nwj~�q�1.JC�������	�.U9k�1�V(�3��l�h�5���^�����@v��+0�=�)K��5[������P��&o3���BT��*������=��Pi�SSm�K�js��%)$�f�t���|"�Bq���s���@�,��n�cgn����f���8���\w�rx�����;�S9?��RHRhK�z{���2��Z��AgPQ-�n�S��'��<� 2"h/k���
o�io8#�Zm��Q�b�4�����\�����B�.P�����h6z]�/��F����h��2�[U����1�qs�-��sA`O��
ZX��$(����i&v���m�)3;�0b3��x�b����L�>r�*\��(�
��}�mif
�f��uS��OS�����%|I���S]���c~�75���!�����@��*!��11��$&"���j�i��\�������[�O���~�����r��V�J�������^*5T�r�
�O �vcz��������x	}v�hhE�R
����/#��|F��[AF�nx;����=|/��~����3y?>��7`��zO�\3����*�*%j��&s��q
���`E�QwA]�%7)n�	�.��V�Cy������9��_������� N��D��A�bL��Cv�_}�"�1$�#��-���GtL���
���O��
FeLdCByE+�:]FJ�'8���cU����.�:���@�x�R�����~z�R��+�w)�T���)�����;z�,�t�$$Z���$���Cw�uR�+��r���w�g�XG�� J����-���92��='
8��P�vH.L<Q.��<v�u5
�;i<��{�o����^�&OM[�^��@�L�q#���f<���+�(���{�}\���]������o��|�|�U���z������a���D��aq��l�����l�r�3�dq*`�B4���>��)�S�@\4��=�-.��'A���0+����JT8~�%#��tT~�;)7a!�Q�����%Fa��`1���T��E�������%��0Eq~u�kxmt��������������g0S�����=^#
=�7�?E�~,4�Lw���}W�B���rd�Z���|��X�S�Dix�����������V�D��RE���A�i�G
v�49m��\}�&+��K�5z~yyy�����FB-�2��;S��kn��?H_�i��le>2�Yc
��
�{M�<�E����{�7�5���0���%[x����jC���
B*+V�T%��a�`��_O�7�p,(A�Zn�������'���}�S?�_X!��_�`���g��4�-���=_n�=[@}�8�Nm=v�\����k�>��_PK�b��(PKIq�PMETA-INF/manifest.xml�S��� ��)��%���"M{(�t��c*�Qt,���	M�e)4��:��7���{W� &��a��U�����}��O���6�Bk ��U����mX�(�J6IT=$I��P�6��$�������	X�+��p�pc'' �3jE��J��%��7�� g#�vU�-��.��r`�suPtl�x��h�j�h�
��v$N����[�)DP:��%R�����.	��<`�@��Ub�/bI@T��xI�2� �P^<���������{��v9�i-��R�8����v��� �zyj�.^?�b�����F����_PK�u��1,PKIq�Pcontent.xml��_s������B���7:�=�x���;;�O�ck;�k�6�%]n������fDeHM^�t�N����Wd%�W'2"N�_��?/�����������`J/~8\��z����������_^���o��_�~������o��|�8\��|suy����������������?^�]�������p������.��~�?��u����_����������o���M��k��������_���v��c�OW��^����j����n���*�y���?����77�|���_~�_2]}�������/��Q�����[��7������WL��^{q�9�^��/��������9�9�O����dy����Bc�o>_��������_��o�~���9�������?*��X{qv��/�}��[~y�����s���������z���{���S��WWW�K]�`��\I�����A����_>��9|�o��9;s��W��i��_-�����#D����B�����Q|���/�������7?\�m�w��_����9���������wZ_}8���ps�1?��|������7�_��X{/�������.��_-_#�/�������w���{u+�O,��j��\���������������w_w7�������_�������W?��W_���<mx��|��;���|K�}�������a��_��q���o|����������~�������u��������W��^������������}��������?-}��?��9�|{xs~����}����������n1������8�\H[�����������������;��_����^�/>\.7b��������o����Y�P�q�����������>\�{{������������/^�g�������p�D��/g��������*��Ov����������7W���{���u��������cc�.���bq����/�+��|�|�n�������g����a��//�6}������{��k�d����������^��+u;���^�V:��[�����vmu��[�^N������P>���^��Lz�}�{][+'~�>|�����/����'�U��j����~��b�,���������W�o�]����o>|<�������}����q�������/?y������7�{S��{�~����������������������+�xy��?_�������Y�>K�����;�����_?|����n��������|����.�]���/?���]�V?/�������,����>�~�7������?����k}�3|���������:2s�������W7�S����U����������_^����o������f}����<�~~~�i���}��p}u��%�K���Q����������_��C}����zC�e�\�|?�=��O�?������������������~������������}?]���/uY�w������:Q�����z�����p��/���}�#�]���?�������75�?�
k�?����>�����8��o��_�d����P���=U���Y�w�}O����\w��z?U��[>�=�G������v�����N&��������6.�������w%����������\?��	��S����^=i�I��<Y���z�O��O���{�����?0O���K��{�G
L��{���y��e<U��;y�7��>U��;y�7�7O6�������4��dc����u�N�@������{���#l���������T��X��m��Su_c���W_���_Q��{��p-]?����/.v����Wo��N��p8{{�����o�[
��L�x~������f]u�Jj{���]�}y~��p~��?��_/�;��������>�<�s
��K�?������7g�^o���~y��O�O����[�~��]��|��Z^����4��uy�o^�����HoW�~Z�uwc�>�������)��5�7x��
^�~��l��5�k_���/�y�u)���^^��XX�F_���~Q^�|XXz��}��3�[^h��������W���/�~��O��/W���������++wmp��:��p�k���h�w������������F����������77�6���������������?}[_/���������J��\�O���o�.{���i����n|��T$�������AL b�~����������(�jJ�uc��R�����LIM7kTs6�]��3������?�V21����y�c(7_j�����g��}�����%^�[���ez 	8��#+��6�P�Vl��%�n��G��\�}�3��NHv�=;�
�uW�^���3�A����� �sE&�V��x��@&2���qJ1�����8��<6
����K�T3W�l��4������5lY�.���#c0�9���B�;�j�r�8���3�5[�]��5��g�����N��s�s�����yvt^�$��G��>B�n��������C��;O����u����8��YX�|�qnU��&�����
����i�w3!���k��r�
F��8r7n�Zzs���H>2�{��5n���kq�1�Ac$��S��)�JE��wk��Dz''`����Uk���8�Cz`K8�`���l�-uo��^0�����x������|XJq����1�p�)x�R�Q�=wo�m+��M��!A�������?Ww+��h�CN�#KJ�ZJ��M"��=�S�Q4���t�{5P�"�qkJT7�vC�@z'=`�����-5�%%���A.cr	����,�R[�z;�G=�|x�6�atn����m�q1?vn,3�@8��R��&Z�o��X ���C���f�r��<�p��P8�!��{o�Zm�@�X$��A���7�J���}�A#���GNI�X�����V"�s0J�yZ����4��3�A,Cb	�����[��>e�^0����xj��15�e�|��4F��8�\�&j����:����@�� ���sm���o��Mc����G��&��s
����������KS�b�:����c(��s�X�w.�}
��@z�=`��������h����Q.cr	�����V��ju�)��A��3 ��g�o����������Y��{9�V�w�H���L*�c�|l	 Ok��ej*�H~��P0������LfM����P$��=��S�QU��<k���H4����S��z��9�q(���%�<����cOK7�{~�Q,Cb	����R]�F��3�A0����x���V;����h��1�p�)x�Rs%K%;w�@�cS pHy�C��T���Ac������D������u,����"��)�d�������1�p�9�,��D���z�G��lT���jRZ�e?=���%�r
23[�.Z�;#y�����&�!���n�����y��(��xO��
�c ��S)�r�����&���98$�<����cJ"���AC���H��J�r��$� ���<���%�b�a�A#���GN��L�&�u� �;����5�Z;�r����A,Cb	���^�Qi-e����O�`�h<u}@k�r��<#�(��X�7NA�y,��Q���0���&@�� �T���:|�<��(��h�CN�#������0wB�X&��A�S
r=�������AC������rU�:�m��;���'6~{.d*=;�r����\�1� 3��^%97��� ��Y��m����i��������������������������v1���W]���9�s�Ad!�]��
���D����D���<6����&k����k=�	
�fW��I�r.��c�yd�f3'5Ms#�������3�����9�5��<�s����6gNtj�F�����}�����y����LV���_��������8�Y����OM�W�|��3�����IM)]�Y��9���<6
��������+�
�6=��E
�fW�aVZ��93��G�`7�b�<�Pf���p��M(h�6���j�l\����6�ah����h^h���94��AOLz�>��w*�6FN��������a8_����K7}�5�M��.0H*�yF����h|m�{+TZR�������������7)�Rrn?vT��P,���QX�������0����@�j����u5�y��8F�9��U)f�V-�@z�&`�����k�B=�������!��c�AfK-������i����g@�8���M�Jm�zO�?��c$���Y�N(i���L�
��c� �H y�;_Cs\�������`�CN�caU�����{�@���D���u���V�o��AC�����^�,ss=?�A�w�V	0O�����$�s���`�3'A���;(y����B`���F�����}u�����)��lr��Ex�4���j�M}C<c�|l"�"Ol�IJ�jK�<Kg���P<�#� ry���{m����#�;�Hy�E�]P��������1���9pT��ri����<��w�6	.Ol�Jo��{��A.Cr	���������\+�����s X�O~�&���T��`��H8�'����j.�y� ��<H����*Y�V�A�M c	���lZ����	C�|d|&	"O~����n)��=x�#<r"{��%�s-����X%�<�*�����}���`��9	�]��������O��(�#��9������q-]�����K��Ac ��s�(�s#���{r�V"�
�C��o��4�R}'�z��Gx�D
��6���/2��G�w`��T�TNTs���q��H8�!�R�Rk-�.c�H���\��x������Knzp�K8�$d6MTL���r�����A�~8���M�����s��M#���R�����@��yx$�<��������k\7=��$\r$s���|��##<0Iy�;W]����G�nz��Gx�$D�ZIX�s�����9X%�<�*s����c>�fL0�������D����A4��Q>G�����q-�Tk���q���@4�������85qD0����B�� �����������1���9�f����~g,����"����k��<�y��H8�!�R�0�v��Q�M ��0Jpyj�-g�Y�y���!��c�B�J�Rs��7=�|x�w�����$S���y<����p�?Nd)��Kw�3���A��@��w��HT�9���2�p�I�,,��Zv���##<0Iy�;W���3�3��z��Gx�$D�����y�G=�����*��V�U����3&���I����RQq�3y���h~��<��,������~p�����������������pvsx����y���r�.�������������a[�$5K�wc�	������O_�*�m��79�AOPz�?;����R���;�����A#��^�X�FR,w���������#>�
��������������H[�����h������vK(�Hz����
zp3 ���J��cf���
z�x���p�g�V*�{�(j�����dI���yX��>Q����@��IRk�94����n$R�ZN�mS=�	J�gG~J�$���F�=����
������)�����=0���hG?Z����T|I�A�B���	=2|��?�����Q!�R�A��`��Y��|�q��TjRgp,�A�p�	����Zo-�:��`�9��fH���X��3D���LKY�%���G5���$�r*M���j�������.��	�����\|���:��X���=�m���aX�\�V�b�����zp�K8�d�q��k��<�b(�QFp�	��Y�,��\v0�AfT2��S���j"����3U�����(��hfa%�������fP0��s�Y��)�������C\0�	���Zc1�|������E�`4sVJ=�sC�AFC�`�OD���o��S)�8�u�T�P0C�	��Ma�����v����h
��5'@����d��Lpnz�L��hr7fJ�'�ez�3+����nf$�v�^�\���9��H��9m����M2�G�`���-V�r�����gp<��S��v�jU�}�MDc&�`�p�[6jl�c�=���%s
2�t���4sn=���91��l��%a����z��L��l��
UM%9o�P:��`���E������}���`�5�@�T#�U�;�
z�=���P���K����=�N(\t
F3���P�y����!�b���i���|�/���|��.���&3$�p�)����r����d(�A�bp�	����R��m=�
&\s
4��fTM�z���
g��Ls2-u&�Y�g��\���9���)
��{�z�<$��PmE�4w�������N�h/�����;C��r3+�����Lf��9P;��eL.��S�������9o�P6�&��������������\F��9�,��'�*�A6���`����;I.]���7=�
&<s4K�TJ��h�����a0�	���T�V������	��N��m3�-�����������:����>N_�of#���s��P0C�	��Ma�BVT����@3hL�9��x���������`�5�@���&Z�s�s(�1sb0�	���:�Z�i��\���9�Vi���z�<��Pmk`���7�y�����C�@��������f�@4fR�9��&���s����T�/����.}T��4�u�`3jN�9�93�����qs����d�7�`��+'����X:�F�`���Ed]�ssf8�z�L��h����$���@3zB�9���IR�����F'.:������*����!�b��:=\����.?^�^���\����~�|��no�x�<����R��|�1�OHx����Y�6I$�s��{�(h�
��@�5��l�3@=�	�hW*�H�h�=��@(f�	��51����RF�������B�F9U_~v����! ��~#���i����Ui�������W�Tjs.D�)f�6�=lI�Jk�y������'?�6�dy�_���X���T�@��k')=�elz��x��$��E��|3Bc
�����D\�:����'(>p�]�p�$��;�6=�����hD��ho��q�E���+I�(Y����� )d��9���@��E����_��D�����L�d�yX�|�q�V�si����l�e�1g`�q6J=�
�z��L��l
�,���x}s+�1�$���l�(�"�\V2�fL0����i���d����@3x��9�U�P���w�����	��N����T.)9��`4fr�9���S�po�\���fL0����Y�������79���4�gNfo�ZNI�!��hFE�9	�%��-�J^��
�g���s
6�
U��Es����d�7'a��D��sM�&���\0�)UfZ�!{s���htD���@jB\K������!Sc0�'j�m~8����k)�Zk)��
zp�K8�d
���O�:1
�f��\s6s��-�s����A��o���p�B��V�u��3fh�9���N�{+���A0c�	��M3%���s�M4�g�`�3Z�&��;OH�� 48�p�9m)e�iA�����1Cc0��dK�J����6=��	&<s4�r'�"�����56�������k������fT4�����tP��n�D�}���qN������\�0�AfP2�������I��<�o�����1������
�7D8��htD���@j�z3K'��y��!�c0�'j�����U�_���V�s���
a����\�1� S���(�Zvn�>����1��l6n��J-�����A��o����p"m��a,�1�c���l-g�K?U����fL0��s���0�^��&Q=���y�@h���dqn#���gp>�������Y���
z038��L��(��L�lz�Lx�$h�f�����sa�P8�����S��&4��C�'$:��@3*�p�I�,R���T'�C���qN�����Y.�=����A��oN�����U��.c����q1����� �����y�����Gg����^�;#lz@29�|����<}!���
����zp�K8�d
we���d�����47�������5Uo4e����d�7�`���������V:c�`�3��ZY��|�`���`�5�@Sg�\-��g=���y�@h��SkNB=
N(\tF�d�Z��26=��y�&[R��R��t��fL0����YZ1JU��Ls,�Q�c��)����rm�yZ���Q��sNg�������[���qN����L�Z�q�M2��	����5�Y9;7��� 3z\�9���J�U|+�=��(\tH�����]�� �+<9��������~p�����������������pvsx����y���������
@OHz����I�6�&b�O�X~�&��?{��-�3��t�Z�Az���������
���{o (f�	������3�ZO���Axb����n$%9����'x�C����+�E���^�cW��B��6�
zp3X���Z��f��Fl�����{v�+&�$���+��5������K������'*<���G�:��K���My�~v��F�4�o���=A�����O�N���}�A~��$`C�-&�T������=0���hG?bm��[sgoz�2��L�Y��[�����6�D�2{7X
fH0����G����rn�!��hMb�5'@�/Ox���w���X��9��</
�V�y��X8c�G`����J*���=���%<s
2�-�������dO��:'Tu�Bm��D������N�h��/O��s�sDh���aX�\�V��Kv.���2&�p�)���<B���?o>�X6�F�����saR������Q��oN�&�X�R$9W��3h2
�9�YZ"]0����Q0��	����;�5��Ds���1.����z�����z�P���f^\�W��=96=
�u>>��;OE��g5�T�P0C�	��M���rUq��0��a1��hvk��4M�%{�`�9�lI�����9��p�L��4' �Rik*��oh����\�3� �z�.��o�����Cb��	U-�r)�<V}�����C�@�Wid*���@4fZ�9���DYk)>.=���%s
2�X�j9�����lF���5'`3���3��Cs����d�7�`��p"�����`,�A�b��	��dT�f��9�fP0��s�Yz[<�&���A4�g�`�*�L�������	��N����U���������:����>N_Jq�������}(�!��cN��7��ko�s+�A�bp�	���iOj�c��z�L��hr����U��^c����iN@�%]8������M.cr	���L3&�.�"� 3xH�9�w��l-�6@�48���)�����oe���i1X�\��2����)�{=���%s
2��4��92�6�f��\s6s.B]{�uf79���%\s
2YD���s��A6��`����v��V���y���`�3�@sMdv��\�>��f�|�sB�R%)����<�AhtB��S0�Y�q��f���h����i����d}�;O�*if��z,�!��cN����R�����s,�A�bp�	��V�������A��kN�&wS%+U�����3)���LKl����9��eL.��s�i����A2��`����Ts���A@�
��^y�D���a���91�T���J����Wozp�K��dr��d����	C����kN�f���vN�t�M2��	���M�DV�����V:�F�`����f��\}���`�9��+u-\�=�M4�'�`�*�:U��}�z�P���f�LR��s'�MFCF���u�z�>���\~�x�|s��:�xqy�r�B9���.���y����G���o+�AxB��]\�����<I����
]�v�	/8��X��2=�	�hO�*���0���L0���3�V�4�d���=1����9���(���o��A~���`C������F�����
z`#��� u��m�y"��H1�5����a[h(Z��zs������'?�r��\04���)8�~ueI�|��A~�����jV��ssv��04�� mB-��\P:��OP|�@���r%S^���;�P���hG#b%���mzp�#8��$�QKR���oz�2��l�Y��<���4��'��{w����2$����@���K�J�z��O�X6�F3��3����z/������A��o���p�F�re�~kc���(�m��fk���&��9�fL0��s���u��y
�����/0��Y+��k�=sz�P��$������h��nz03���L6.TkN�X��`��9	��R%N]�;/��3j��9�]3U����`���h�9'���V��&oNj(�A�R0�)�\#�5Uc�z��d%�9	�=���s�����S]��)UIT{����FG>:��I�esn���i�����w'�m��hJ*��.c����\�1� SDJ!-������lM��5g`�I��tR}�`����d�7�`s�h*����c,�1�c����=��Zm���<�fL0��s���e�\����@3xb�9�U�Pk���z�P��$��n�U[��L�QFcF�`�3�����iN�L��`��9	�e����������pF
��7���k��=9���@3*�p�I�,e��V������xM��8�`S[�dM����z��L��$l��dYKq����f���s
D��
w��-�z Q��$��$�r2sB��i����������Lgk������\���9��������~�X6�&���3���tJ��w�\��9��u=@�w1�
��t���4g@���HK��>��^0c�	��MM����;w���@3x^�9�u��K$w�aF��'.:	�]�����������<g���5i�]�7=��	&<s4KO�z��'�A4�����S���3����Sz���h�7'�����][s���xM��8�`S[�.j���d%�9	�=5��;���f���s
D�|������MD�#
���qW.NH7= 
��y>Q����o������R�}�=��%s2EX��pM�d�P6�����3����J��Ds����d�7�`S�I����>[���m��fk5�b��<=w���`�5�@SS1���9��6=���y�@h���T��<=
N(\tF-'�}y�t2���h���s0o���jY���{=��	&<s4K�L�k1'�C���oN����\��.c���h�9'������
�v��g���s
6�37��^���nz��L��$l���,����z�=0��Q��9�5e��������N���,�������^g���}���:���\~�x�|s��:�xqy�r�B9�������x�r�y1.g z(=!��.��'�G��S�d�7�2���,��^��^���$>|=�	J�g��[��Si���>G=��e���f>fF�J��y�Axb���3���j�
�z�<	��"m�P����x���(8Ep�;sE:�TJsv�6=8���	�On����3�Axb���q�Tk�{�)��D���}vs��I.���Q
t�����{d�_b�Z}�L�@(h�����H3���+g������v��X��M��~=����
����R%����t�����v�#���QR��J�)d����� �P�F�<e1J�g���c�	���5��h�\X��v����@3h
�9���B��V|O��`�9��{-���y�o�
g��Ls2�^i���'�j�A.cr	���L�d�E��.c����YX��j��Ze�=�48���)�ey��\M�9�������:'�27��s�n�8��eL.��S��7S�-����f��\s6sn��������dF%�9�,�	��+��@g�lls4��B��j����A��k��f�F)Y��q�M4��`�*��^*�;�z���z�P���f�D���}�1��h�������������B�jO��V0C�	��M�*J��T���X4��������&*�i�mI2�fP0��S����R���y��P8c��`���`V��j�]���1��gNA��.T[���=��uN�����szPx���"�
/�9��6=���uN�enV�X���Gozp�K8�dr������@z(�Q�bp�	������w��M2��	���M�T*�������3hX�9���)%6�N���3(�p�9�,����|�`���	1X�|���	2[S��E��F':��M�v����������:��q����W�}�;O����j��0��`�1�@S������|��h���5'@����rj�G��3(�p�)���j�b��74��Y1��d���Y4%�qb�\���9�f���[���z�<"��Pm�S�-9�7=
(<t
D{��E�y����1�b��	������f��(�\���9�\j��I�;o�P6�&�������jc���M2��	���M��L�;�C@g��ls4�T���{ ���A��k������Y��r��hF���<' T,u*%��(��F'.:�����AR�sG�MFC��`�O��ov��9���������sHh(�!��cN��p��>HVsnN2��a1��h��[JZX�}��`�9����k�Ir�S;��i1��dv[8�Y���4=���%<s
2�[�&��@� 3xH�9��r���\���gp<��S�K)�Kf�,���1�b0�	�������p��2&�p�)��r�LV�����lFM��5'`��\��:�Vnrp�K��d��uk���<Ql,�A�b0�	��R��w���Qs���`�5�@��L����nz�=���P�T)���*69���'<t
B3[���e�����!cb��8=\����.?^�^���\����~�|��no�x�<����Zk"�[���OHx����Y�.Z��5��3���+��n5a!M=�oy��>A����@���f�^�
�P��h?�V)�����'&=��=��������A~�G�`C����*$���*69 
�h��\���wS�p��H1S50���a�F�[7�P��=1�����O��(sS���w_��&T�@�������w����D���I�U����X���=`A����7�������'(>p�]�p5�Kt�sA��@�30��H:II�~�Q��S?����P�5g�a5�$��8<zh�&I��w��We�T��Wmzp���ak ���Y9-}�*����f�X\s6w#���s����A��o���p�L��V�M@g�4	ls4[5�T�9O,�3&�p�9����������i�������������4�AB�
�����'���uo����1�;0���
�����K�3&���I�,U����t,�Q�F��)����S��\����fT4����Y$�u�������1)�l�b�\��;�x��A��oN�f/F�f�\����S]��)U)���u�{9�(\tDo�5�w��MHC&�`�O��^�����+�.O��������2$�p�9�.��M�x�)[���k��f��TR��=O=�J&|s6��2I)������3fp�9����eI��{�`��9���z.L��C�=���y�@h�f�K��7�3�AhpB��s0�R��vW�;}mz038��L��(-�9��3&���I�,�$j-g��q�hF
��5�@��uAA�

z�M��$p�,F��W����xM��8�`s�gj��}�@�d%�9	�}���R��$�Q6���`�S �KG5����\oz Q��$�v�9g�n�u����1��5�s������Mhf����
zp�K8�d
�5h��:���f��\s6���wj��R69��%\s2���)0����h(�1cc0��l�5j�������1��k�����)����3�y���i1���V�F��k���'.:	�����AW����1cc0��d�us���'�mz�Lx�$h�Z��'��\�V8�����S��{�Tr�f�6=���&�s8W�\U�
�3hn�9�������{��AfP2�����k�f��w���lF���>�@tMhv�~�rP��$��'n�����@4dj��D�;���M��m�,eI�/�A.Cr	���L����(���� 3hf�9�MD�����d%�9����Z�����3fh�9��%����g��2&���9��T3����9��f���sBk�D&*�7J;�AhpB���0jZ���o�s�����1��`�e&�����1��gN�fiYH*ku���pF
��7�����)����{9��
&|s4�4��tR����3h^�9���t=�����AfP2������P���7�2��f���s
D�h&K3/���FG>:	��c$��Jq��7}4H�g�|�)���������~p�����������������pvsx{��|�r��Pck�	�PzB��]|�O��gY�$��W69�	�������%�����[�=�AOPz�=;����R���[�
 (f�	��>��P�U���z����#��*�T��>�<��#U-�S���Yi������v����H��97��Q�`
Lh/x�&#�5W�:�Axb����Y�&+T+kw^�P��T�?��O�����s���>Q����@R����94��I�n$Z����F=�	J�gG~j*�X|�1��O��lh������)���A��c?��?���9���r`2��,�Yp����K�JOTR�������gaj@���%���H����f�4\s4-�W�:�A0��	��M���j�9��C��"�iN@��u�������y����\�3� �z���9�8=��w�uN��ij����48���)��T���7���h���s*�r'�����\��~9�\S������7�w,�QFp�	�����������dF%�9�,�U��zS�[����mN�f�����,��fP0��s�YS��{��������!.����1S���s�����	��N�h�������7�y��!�b��'"���7z���J�85�M
fH0��S�)"�S��}���`
��3'�[R����>��`�9�lb��Hs.�g��Ls2oO{���s�v����\�3� �SZ��WI�]��<"��Q��(�o���hG�9��������l=���iN�e�uo�^�9��mzp�K8�drM�Q6����},�QSbp�	���
���{����Q��oN�&/������	���c��mN�f��)����Oj@J8�XV^��<Fy����0�����T�o�v������E�`4���b���26=
�u>M��,��	����h'e���c�	&s
4�M��-��n�X4�����������}�1�fP0��S���m=����=s+�1Sb0�	�4�L-�T|=���%<s
29�A�bm��7�y,��#b0�	��3���ezP����&��%3�����1�b��	���5I5{��<��eL.��S��5������`3jR�9�9[��SW���dF%�9�,��j���
@g��ls4��B�{�n}��fP0��s�Y��W���hnz�=%���P1��X0��-=�N(\t
Fos�-�;��NFC��`�O��o����k��w�JS�"��S�X0C�	��M��
K*��P4������iI�*��M,�b	����H�B�JJ��g��,s2-�D��4�g��2&���)���V�,�y"�P6���`� ��:u�Z|�`�����E�@��R)sQ���y���i1X�\f�JUj7�)�\���9�\�eiI�K��5+�����{����\Q;�AfT2��S��R�S/���K;���a1��hfi�v��z�L��h��H�{
e���1����1Sf���
z�P���.�u�$���k����a��`���Ok���/^/�\o��?^\^�\�Pg7����gA��[��H���=�		�wq�?+>r�])�G�&<ACW����i����3~1��OP|�>{��^)U��9<1��	&�~���Xj���Azb���s���PS���g���� ��~�����JR�8��F�1����K���#�o4n������n��	S6Q�U�`'&;p�=���1�Z��� (j:��A];������M~�����R[�\�7�	�C
����6�,5�3����OP|�@���
S����x���i	��F$�����mzp�#8�����Zsqn#��AR�x�s��g��^����*�PmR�w�\���y��|�q)U(���M)��4�����&Y�R1v.p�� 3(���9��}��jY�gk+�1�$���lm
=f���Q=��	&\s4oa��y��&���/����u=jy�d������	��N���B\�V}nr3���K66J��9W�nz�L8�$hk����H����pFM�7���k%Un�c�=���&�s8KS�f=u����x
I�8�`S[��;�m;�AfP2�����&[�,���`3z��9�*���9�;ty�����NiO�85�n���i������EB�|u�Z��Z�2>��eH.��s�)������3�2����1��l6Y��b.���6=�J&|s6�;'b�E�g�@g��ls4[3�,�W�V��`��9���RI��3����f���sBk�F"E�w��'.:	��*i)��s�����1��`�I�.f���<�fL0����YS5j��9�9���1��pv��U�8OG��@3*�p�I�,�:��-;�)C���qN���B%w���6=�J&|s6{6�i��zO�<��f���s
D��1{��\)4��htD���@�S%K\�y6= 
��y>Q����/7�������ic�\���9���O#kY�#�c���k��fc+�rM�y����A��o�����N�E�����t����6g@���d��:�B��3&�p�9�TN��H5�<����c0�����'���i�AB�
��Qm�N��w=��y���������6=��	&<s4�>��V������58�����M���������h�9'��4���N��4��xM��8�`�����������A��oN����,�9�.c�����1�������xwD������E'AtMh�VK���QHC&�`�O��o4��!��n\'��yw,��2$�p�9�Y��Y1g�v�����1x�d6Y��U�e�ez��L��l
k�����4��36�������u����<�fL0��s����47���A4���`�3Ze]�S�����MB�
���5�Y�����MFc��`�3�����+;��3&���I��)7RU6�yC���oNgW���6���M4��	���R�Qj�gghl(�ASc0�)����UjQo��^2��	����^����
����lF���>�@T��0'w����F.:	�=�aU�z�A
�����sc��:/�Nk���/^/�\o��?^\^�\�Pg7���������o<eM�r�n����NHv����I�nZ(7M���O���g/~��NR���2=�	J�g���IfJ��9�oC�d������b���ofl�����{�9HU(��y��P����`B{q�Z�3a$;����(8E����*������G1�50�����72��cg�����yv|*��z�y'Q�5����}rk�B�����'*>p�HJ����]�:��)�n$}�����1�Az���������Y�L��?�c�����Zi��J��?3��Qt��G{��x��ii���;�R�T�30�h����R��2i�)��F���gak�����VMTJi�CQ��4������S�����v���`�5�@�-�B"���c��"�iN@�%+���+=���%<s
29��IJ���!�P6�'^`� �����7�9�hp@��S �k���y���@4fd�9���BU�����\���9�\z+Tzj���9���1#��l.O��R�$N��� 3*���)�\��e���x��C����mN�f��������lz�L��h���wsn�<��f�(�sB�r���f�3����	��N�hf�*Y|=
�u>>��;O��EL�s���`��9��=3�e�"C��kN���])%���9��`�9�lK�ZO�U{�h�L��2'���*i��}�9��eL.��S���2�d�@���c��yN��j�d��sS�A@�
���3���Lq�@4fZ�9��Y���e�$���1��cNA��TJRs��8��f��\s6s�DYJs�p����\�5� �%k'���7�����Q1��h�%���3]=�fP0��s�y{Je*�9�9��f���sB�����}�`����|�C� 4KR2kZ����h����i�
�><6���<���%������`��9��Z3U��|�1��A1��hv����\}�1�fP0��S��&E�����Lp�@3fR�9����r��S4��eL.��s�i�P������d���:'TU*����'�&�����Nh/f��1;����@4fJ�9�����]��\����eL.��S���g#���y��5%�����[��g�knz��L��l�dS*��svs����!1��`f�B��������`�9�5)�R����`F���:'�S�d�9)�� 4:���)������No�`4dH��4����<����y*�/�U����c�	&s
4��	�j��?�X4����������j���Q0��	��M�T�����g��Ls2��z������1��gNA&�������-���<"��Q���kj���=
(\t
D{m�����Y�ADcf�`�p�[W*Er���y����\�1� �K����I��E�dF���3' 3��dV���z��L��l���D���M
@g��ls4���<:�\��{=�
&\s4+'���i��hF���<' T,Yv��=�N(\t
F��N��R�s)����}���p}Z{w?��x�z��zsu��������r8�9�]��E�?����m�Xz��3�OHx����Y�n�kj��n,@AcW���j�e��9�>��OP|�@�:PjJ�W_�p,B13L0��HS��X���z������2������A~���`C���gM$�K��#z`#����9N�������A�b&k`C�����j�Nx69����gOz�J�V9�7<������~��)UM���A~�����JQ���(o����q�~�hO�[��<z`�����v��5���;��P���hG#Z���-�C	�E����#I����l�{5(
ox�O���������q-�cSi,�M)=������h�xd
k+���{Ox�
�f�H\s6��B�t1BgX���A��o���0+�������3f��9���Fi���1��3&�p�9�TN�����3=���f���sBk�B�k1����'.:�-��o,>F=����y�&[n����y�{=��	&<s4�V�Z�%����pF��7��������1mz�M8�$p��CQ����&�3hH
�9���>�����d%�9	�k������S{�����.����!�����4�AD�#
��.���������L��:��q�#��z_��]z��-�9�G=��%�r2��������l���5g`s���Xv�]<�AfP2��s��4�����d��t���6g@�5+�S����z�L��hj2��re���A4���`�3Z[��j�������	��N���JU{����=��y�&[�\zw��M0c�	����-���x���
g��|s
8��n�����hFE�9	��p!KI���c���qN���J�K�N47=�J&|s6�����49���lF���>�@��X�\��cB��FG>:	�]
�E�gF��4dv��D����>�����R�Yf�����!��c�A�pOk|��s���lM��5g`�I[z�������A��o���$��D[�FC���m��fSN������3&�p�9�T�F�����}����1��������^���MB�
��Q#�$N�`4ft�9�l���4v���3&���I�,����V����3jp�9�kFSKf�]���Q��sNg)\�����3hr�9������6�� 3(���I���P���7�r�����1���j.T3L^�<��htD���@���-��O�>�i�������0_���j�j��w�\���9���3����0��lM��5g`�If�Z}�)�d%�9��y}t,]���w,�1�c���l�F��s.e���X�3�Sy!-U�.�`O��8g��6�����{�� 48���I5Vb.I�gt�`4fl�9��gn&�"����`��9	�E{��9;��g��|s
8{��K��=����Q��sNg�5-�%a���X<���`�S���3-in�3:�� 3(���I�\���j��������b��)]������lz Q��$�v�$Uk��u}�GHO��\�]�>|����-���}$����&�.�TR&m��9X;�@@~��O��HR���$��bC�|l�.	$Om��UJ���.c��P@�%'A�m�;j�rq�c���A�$�<�$���e��^z��Gx�$DrJMIKv�p9�I�D�h����J���<�eH(���`�[K�r������O�L��<>}����[���M#���G3�T:k�u\=�|l"	"Ow��+���97{�� 2�p�I�T����8�z0��L@~�I�J�Z����7=�$<r$�j�z���Hz'A`� �t2�d����3v����d�5ga��L�D}W��A���!0�g	�3 ���{)F"�\�~��@0�'�QD��8����� ��\H��x���zr��lr�
Gx�$@�bvBM*�o&d,�����"A�����s�]����x�GNB$��
�j��_0�I��h���jg�l-�� �`��9	���J��twr�^4��*�xz�����R�����c$����<&��=;{�c�|lB.	&Oo�f!��������1�p�I�T�����|��=�|d�	 ��";S��z�Z�@�9��~�Io�m��z ���Y���/�$-/Rov�^2c�	�����*k�y`������*�%��|����?��KMT��N��@@�!'AR����Esn�1��fD��@�����L����Q C	��I6��R���V
����Q�$�<�qK�(��`��1���I���,Qk"�e<c�����,�����f�����w���`��9	��5��zr��z����V�����L\�{�!�<F�9	��c"W������X&��K���]r=Q2s��3	�r��G��$D��-���}�!�D>2���_a��S+��s�������C�������wd����I�%�<���F�{1�v�d�$�9��:)'a�V G=�|x>V�,q��u������Z��d�@�X 	��Iak�.-/�i�C�|l>.	$Om���	n�:&��2�p�I�\�Fu���\�3�GFy`� ���-q����o�i���H<�#'!r=TR�k����X&�3!0K�yj��5�:b��%k���`�5'A��TI)5�o]���O�`5x<�qQn���s�r���H<�!'!r=TR�TRs�^7���&D��`�t�\#�\S��c(���hI���|����	����y���Z3�HU�������{d���(+�'
y�[�f��r���!�<��D�">Y�I3[t&���_7=���G�$���I����=��|�d* ��U���h����q!�HFB`� ��d1y�� 3'�p�*l�yfV#xZ���o���*��8�<�mg���>��pf����D@�!� )�M����l
���VD��@��_�������d* ��E�d����������_y`� ��w�Nc�h=�M3��,B$�&J����Gn
�dt%f	4�{q���vIc��mz��L�f4��Nk�r��W=�|{A�y���/�r���Gp7����x�C!�E*5�3\��U"������H��\���]u=��E$\��f��2����M&���&	 ?`��i�%�P������GVA��$�>Z���K$�� 0K��~2�tj���d^z���L�f6gcR��_16/=�|{E��U������1^��K��F����M��CD�������6F����L~kE.	&�}���I�7�}~�� 2���*L
�u�-z`�C&���&	 �o������q������GVA�x6�a���x.=�����,A��/�c
�&������)��kVas��d����(7=�|{M��~y���;9/�`-�K 3	�,�d�l��������[k"�I@����%��Fm��@2�p�*Pz�N��`m�K(���6	"?pq�A�G���D�".Y��)������������%���]�E.�����3'�p�:hZ'a	V�������X����`�{.n�5�<�y���D<�!�)��h��F��W0��E�$�|���t��{�d�KS���B�p�cb(�5��uo*���&	$�o�������6=��$\�
����w�X/.9��.��*���/��gM:�s����d�$�Y�������^z����v���v=��*f�,��@fY�����Fw�\z ���$���o�p=��F�M$s!	�����4�19��D�����y`� ��]M�����Ad*"��e��2O���Z�LFWC`�@�#v�&y3����9��o�����9\�U8�^�Y�9y�-G�������N��������1�p�*D
�:��������Z�G��w_|J[$�4xz������KVaRx��a>4��go*���6	$�o���Y�cx�&/=��$\�
��&$>zp)d���R�d���:�w�����AfJ2��U����x�X76=�|{M;z�~�23E	N,/9p��#���}����x[@��5�$���o��`6���@2�p�*P�`�f����P~�;lD~����Lv;��x�C"S	�,������S!�LFWC`�@�#v���[����X���YLS��,\�w9�|{Q6���16�{{����0's3�R���1���*D�p?������{0���$�|����9Dg�;������OVa�p���d������o|��M������J�������KVA��&��+6p��2������*/}I����\���Y����a����M6�^�N����,4U�FK/?�2���2H6a&��+m��o���'�~��I��5�����\H�)�@��)����f�M(���6	"?�����m�O�|���TD�%�09e�1��B��`2�����6i�����r��L�f4Mh���k^z��������cp7�=��������r���D<�!�)���2[+�����o-��%���/>e����!�K"S	������Pw���<{P��=�I �~��`�-���d& ��U�4>���`a�M$��!0K����Y��WpN��AfJ2��e��>�eF��lz����v�������9Fo�y�d& ��e�l��&��=x���P~kU>	(?���R�g1��q��H�BNYJ��m��-�lz@��/=�I���~�39����\�/z��H�d&�|$�&�c)����j�`~�,�$:��8?�@3'�p�:p�s�xe�W�������I�`���V� w���:�����#W���b����/.l2�w�['����.8g	8������zl����I��s����I��F'�[��3��,�����5�[B�� 3'���"lrk�r�����t&O~�@K@jS���|���hrD��E ]�����}���4g(�Y�L�kPocxl���AfN2��E�tw!_�,���lf���7K�)��x�&1�|�AfV2��E�4�.a����3if�YM���]d}����h�7��i��W�����pf���@k0�rn{�!�x>�`4;�p�*��-ZS�3�MJS��`���F���~;�Fnv��W��M�f8�m:��������3i��Y�e��;%xN���I��s���Z���s8��h<s��`�%��&N�yz,���AfN2��E�����Sbp�
@g����6Mhz�2���hrD��E ]��/�h��|������>K�)�:u����*�d�$�Y�Mw7���	�W=���#�o�`S�;�w
�7���fV6��E�4���m�fnz��4H�,���dc�
N8/=�L�&|�
��Bl�$�����c0��v_��l+�����hvF��U(�s��G��a�CJS�`��u����W����K?&�:z�$��h�D�YNa3�54z���p&
��9K��r*�����w��fR4��E��ec�9��x^
�g�$����F#�}p4������d�7��y�����{�7=�L�}�@��l4���Vt6=M�(|��K��d�`��MHs��`�%�����%��~�K2s�	�,�����,��|�lfM��7K�)����E�"\z���M8g:�L�����nlz��4F�,���\4x��)��z��M�f8M���6��x>��3{zZ���S9�r�uc������I�P���y�������12��g]>���������������m
�fJ4��E�6V�������f�|��������-|�h&E�YN^���[�2���93d��d�����q��d�$�Y�Mn�MRa��	���3yxZ�/y��F���MD�#
'-�R��2�b�nz@�3E�,A���4�����rp��Kxf2��S��=����Ydp�l��u��3����Y��s���u��b����3i��YM���#s�,���I��oV��x�<�F��k/=������`�<�s�XQ��`4;�p�*��)��>����)#d�������?����d���]���h�5��)�f#_��b�x��p&
��9K���+�XS�A�K4��	�,'��:��>b��3g��Y���1'�3V�k����d�7��y�����c�M6���`�%��L�����=M�(|��Ky��n1��@4g��Y�K�����xS�� 3'���"l��I�:gl`���f�|��2���X�6r���L�f6���)+zv�C:�F�`�%��J>��Gw�.�I��kVA�����
V,���3{n�Y�����Tb_�6=��(��
�j����`H������`����o%��y��!Q��*��u�AfJ2�i��;\��2��s'�[��4@��A��y��s���nz���Mxg:��j����s���3g��Y��]h�*�>mz��M8g8�q��6g��p&��@k0�}.R��?�G�������I�P������J	���L��@k��c	��-�mh���h�7����6��K�������&�w�s��%\���@3+���*p��I<k��M<��`�5�dL>[�`7.=�L�&����hxk����tfO��B�@:Lh�6g,���ivH��e0��h��-������2��]>X������A6����rp��Kxf2E�
S�G�7�Ie��t����1j����`3)���*t
����(�	��|����:k�9u2�v����C4s�	�,�/!�6��6lz��<I���VR������������<�W����Ai�H��<l��9����h�D�Y��\I��`�����)�w�sh��������p�=���|U7��/=�L�)�u�`�u4�v�T�y�W=�L�&������?�h������d��"�36m���CH�C
/-����\M{�����)se0�O�|l�����_�d-�5c� mz���L�f6E�a����;���3i��Y��y�&�s�8/9�LJ&��
��C����X�����*�q��sjS���c���@3'�p�2p�Rj�g���h&O��>k��N�|������&g>Z��5&u=�1H���9#e0�h~9�S='{�fN0��e�l�;��jpK��xf���9��9������=W��`�9�������38�����3i��Y���s:��yp�y��fR6��u�\N��9xd����#d��"�SZ"���^z@�RxiL�����=����)�d0�O�|�����_��&������AfJ2��U�i�i.�=|����L�&�s��s6�kt
Ov��fR6��U���Is0��7�9�d��p�s!si��4���9��s���}���=8����3y�Z�Qs�����*��&gNZ��u�^����nzP�3R��&sr�k������fN4��e�l���{�<wwk<�����E��i
�{�7=��
'��
���h���uc����2Xg
6���9U9zv�C6��	��C�7Z�W����=��&����<���N�Vz�ivH��e0=��[o��D������@?���z�������ur�?/9�L�%<�
�����Y�xl�so:�f���5��M��>w����I��wV�S�<
��j<{k>s��`�5���;
^<Ze���h�9��y�7�1Z}G��E4�'�`�5Ue'��+�#��`49���:���9;��nzP�3R��&����9zt��C4s�	�,gk>i�k�L��xf
��;��y� 
���t^z��N�g<]�Yd��^r��4O��A&kw��F����I��o�������^z��=F-����s��v�ivH��e0=G�"K5Z7������2��3e?�@��$���n����_v��Y{���b����/.<����������L�(�s��s�:Pl�*C�h&E�Y�3��<�
.�l
�g�@������@��`Ve����d�7�����A����)[��<I-��i�+x6�������t�l�xE�X���4g��Y�L�6h�������\�3�����DZ��f�M6����%�e;��-���P���\�5��i&Jfm�S6=�L�"�i�@S�q�>c���@3)���*p������g{����1h
Fwj����^r��P�hFu�V+
���)d0�Oc#F�]�}t���I0u�5�)��k�Sx�F��#v���p&
��9K��|-��b���X&��YL^s���`>ek<s��`�%���;5��#��AfN2��E�<c���7	�D�����1h	Hm� eS�}���@49�p�"�.�I���w0������1�g	2��R����AfN2��E�t�Nn�Z����`3k��Y�MQv��|��_z���M8g:mz';w���	mz��4B�,���	�e-�)h���h�7��i�iv�{��f�����gXs,o,�w��hvF��U(Usj"=�������<?��1�z���������|�
�fJ4��E����WnDx�f�\�������Mv>�@3)�p�"p�2S�������x�L��8K��<:�����K.sr	�,B&�&�t���/�{��<6�,�����Zpe�����I�@�t2M7��N�MHs��`�%��r����
��d�$�Y��c:��BA��pe�G��5?�,A��bv
�������l�;��i6�x,��unz��4@�,���159Pg;_�@3)���*p�0��F7�^z��=9���`��|X���KF�3
'�B���0�h�����!2��g]�Y��;�S����w5�)��k�S����z�d��p&
��9K�����o����h&E�YN^�	5f��������%�q�`��;MW������9��oa�[3#������L!�����L'i_#��r�hr@��E]gyZ�c>K�mz@�3I�,A�Kf�w�����9��ka�}
riMb�6=���"�o�`ST������Y��k!�L�D�,�����3i��Y���8�����M4��	����F��Z0�r�g�����������/=��(��
������`��KJSF�`��u�������w��nJs������h�D�YN���|L�`����L!�s��s�j�S�c���@3)�p�"p�r
-��c�����!�q�`�0B�6���=��2'�p�"d�rZ�����{��<:�,��:��,����hrD��E ]j�����3��i����bM��H�����9��ka�����%6���`3k��Y�MQa�c��V�K6��	�,B��s������3i��YM�y���.���C4��	������k�`��M8���`�5<h����j���O�hB���>�F�r0�2@������$��#��>h���1����(�����\���w8�p���E����`3ix�Y��)��]8�Mh����l�9��),���x������1g
4�I�68x��%�9��kVA��I[Q67=�L�}�`T�R�6��F��`49�p�*���
�1���KJs��`�5�d�F��H�d�K4s�	�,g���l����Y�cp�"p�c*�L�u�.=��
'��
���I[o<�����2Xg
6Y�M�<��6�nz���M8g:g;��~�^z��=?-��E<}�`7.= �)����v~��]^y��i�<��.��������S>�����^�d�$�Y�Ma��hu��z�7�I�dp�tN�B��8�������l�;��)<���sp�so>s��`�5��������e.=���&��
��,���{�7=�L�%���`T�����c'�mz0��Q8iJ���A:cK+������>k�����CuD7�>�@3'�p�2p6a�!6�p^z��5P�,�0#��$����gV8��U���L����)��x&M��:k��:;����}���`3)�p�2t��M��-t��f���gv�y�����4;�p�2�����]99���4e��i��9��~����d7u������d�$�Y�MBkMV�����y28g
:�L�5�<Vl����l�;��y8�7�n3F���9�d0�hN�v��rnz��M�f8����vL"�I��p&O��@k0�}6jMl���`49�p�2����6w���w/=(�(���@��ELQq�p��fN4��e�l2����
�g�@���c5���`��gV8��U������3�M<�&�`�5�d�L�q�z��fR6��e�<���s�`�����Q2XhH_��\~p�{~�����K�`��h6
��/=0M�*��~����D}}�p�%��Sw7=�LI&\�
��Q����\�7�I3ep�tN�Ic�h��w5�L�%|�
��];�>5��vo>sF�`�5���'�0
������`�7��i����������sd���j7=�Z��y��hrF��U(�M�dZp�y��h�8���/�t�hmE7"<�@3'�p�2p6Q&kkE��n
�g�0������3z����Y��{V���p�.<b����3i��Y���s:�h<�a����l�9��9�Q�5z��C:���`�E =���|��=��4;���2���M�1���.=0M�'��~��ce~�.�����X�Vo���M2S�	����Hk���,�l�
@g�D���S��2���d9�LJ&��
��	��dq�3���3g��Y�irL!������@3'�p�*pK�es���p&O��@k0��)��i��=z0��Q8iJ}85��b5M6=(�*���@�%���-\x���9��o�����N?�0X�vo<�F���E�����=	�pf��YO������X�M<�f�`�5�d�B��8X������l�9��y����e?
]z��=J-��I>����^z@�RxiLU'�9T��.}6L�uBj�T�0�_+	fpQ��������>�7�)��!�	8?�����<Fw������L�*�s��sys&�c����fR4��E���S���`leo<s��`�%��&N>���mz���L�f6�5c3������L�%������)k���@49�p�"��sS�F7&lz@�3R�,A�Xc��Ucdnz���L�f6��$����M6�����%��Ec����y��fV6��E�4[F��b�y��f�$l���6����k��fR0��U�4a�!}����g��������EO<z�AhvB��U��gZs� �������<?��w��.���D�
�
�fJ4��E��,��c����L!�s��s-�4�5xX���I��s����c�*L�l
�g�����LiY�=��d�$�Y�Mn���5�r�{��<:-�M:&��c�)��&GNZ��c�lSVl�u���2�g	2e���[�}�� 3'�p�"l�C�>4\������	2�f	6E��5�Q���`3+�p�"t����[[�onz��4F�,���Y]o��
x���fR4��U�4Y��l�q�g�����-S��������V�T��������������>?���z��(����|D�����h�5����@N�>�='x~i8����%�\�
=�0��v���h�9�������<$�U���I2g	6��h���hnz���L�f6�9O��$6�����2h	H����"	��&GNZ��}�8������i�$���b����#�#�� 3'�p�"l�Ifj�{pl�7�Y�dp�t���>x��nz���Mxg:����5������3i��YMQ�������h�7��i���+f^r��=?��A���6V�g<���hvF��U(=��l�2�Fz�Ai� ���.#�m���w���V�.�\
�fJ4��E�>�����
`[��4H�,�r�k��G��@3)�p�"p�2Z}��s�
�g�$����{#������2'���"\�E�-�c��_�h:���`�% 5��<~A�%m7=M�(���Ku��k���= ��"�}� S�
2�`]�K.sr	�,B��+����<s�����1�f	6E���>��U2��	�,���l�:��&;z��4>�,���u��kFK"<�@3)���*p��u�~
z�g�����)�2���^z0��Q8iJ��L�d�����>?����oW"�������� |`��h�D�YNa5����`����L �s��s�-���nlz��M8g8y��4U9�Uho<s��`�%�\>;��[�7_��2'�p�"d��q*��R��3yt�YR���{-v��������t����{��v�MHs��`�%���F�uWTrp��Kxf2��JAf�����f�\���������;_�`3+�p�"t�9�ug�=��t&
��8K��r�������/z��M�f8M�|6m�T�����c0�����eM�����`4;�p�*��j��sE�=��4e����?~�[I2�{�n�6��P7���� 3%�?�4�����&����_[��4D��A�l��9�m���`3)���*t�1BUM��f�{��3I������>u����@3'�p�*pZ��z*�j��p&��@k0�ctru���s������I�P�]����nzP�3M��&k�d�V�S��fN4��e�l�r�u_[����y2xg<�M��{�����Y��{V���Dv��n�:	�x&���:k��:�����n���fR6��e����\���mz��=K-��ImtY��	��f�^ZSU&_s���CLS��`��v�`�����&]sJ�����)��kVaS��N�3��oo:�����5���\�l�S+9�LJ&��
������c��{��3T����}�5��{>�@3'�p�*pZ�x
��nlz��<K���vo4�\�E�MF�3
'�B�l,tF�{l����9Ce0�h��Ek�����K4s�	�,ggre
�y�g�@���gzS��b_z��Nxg<]t�1L�#�ah����2Xg
6��7����
�l&e�Y������:Gp�y�Ag� ,����q������f�^ZS�N��V�r���)3e0�O�����?H{2����s�M2S�	����H[B���?�����L�)�s��s
�%����7lz���Mxg:��a�b����7�9Ce��p�9�x�I���9��sV����[7
�����3y�Z�Q�����������hrF��U(�g%�v�c�4�MJs��`�5�dm�����=_�@3'���2p6�B]�9�]hk<�����E���d��
z���Y��{V���2����s����2Xg
6_�6�X�2C=�L�&���������_z��=L-�p�6��Qoz@�RxiLU��6�#�������@?���4���_��w��G��L�� 3%�p�*l
�T�>�bo��3i��Y������j�����I��wV�S��E���	�W��3V������?+�-C�`��YMk����`{���Y2�g
Fu��\2,����`49�p�2��bi��Sx/=(�*���@�����s�����z��M�f8[W�i�Gp���xf���;��y���k����M8��	�������Y4��v����2Xg
6Y'�h2f����I��s����o���t��Ag�(,�������}>��4;���2����i#�����L��<?�������_�%����
���� 3%�p�*l�9H�1�=xB��t&���9k������a�
C�l&e�Y�NaBKE�Q��t���8k�9�6bQ	���@3'���*p�9�\]Z���7=�L�#���`T�(��,�������V�t�H�k��KFs��`�5�|In������.=���&\��M�����]���a2xg<�����p��U8��	������>9�W���3i��Y�M��k.Q���`3)�p�2t�6���v^z��=D-��A�mi�X����!������n�>�u��}6Lc��'��O������`����6���%���h�D���&�����	
fY����p&M��9K���@mv��<���fR4��E�d?�����$�]hk<sF�`�%������b3�M2s�	�,�&���������l&���>K jK&�������hrD��E ]�
��8t�i�@���b���uYc���9��ka��MZ���%���3k��Y�N�����=�����fV6��E�<w����A:7=�L�%�q�@St
b[+��K0��	����i#;�H�_�z��=>�����ay�y������V�t�Psm-V%~����12����������h�Z�E���L�&\���6�1X]�n
g�����]i��cs���h�9����&����t�7�9sd0�l��Il��$��AfN2��U�t_�����l&���>K j�;��f���������t��a�#����i����bl4�r�@���9��ka�[�J,�F���3k��Y�N�!�ec�`R���Y��w���h4Vo���t&
��8K�)����I0Fv��fR4��U����[��)�����c0��d�������fgNZ���F:��Um7=(M$�}~���uz�f<�������L�&\�����������3i��Y�����7mz��M8g8����%�F�{��3I�,���n�3�-h����d�7����eP_sj�����L!������"M2G��"	��&GNZ�e��{����KDs&�`�%��3M��h4��*�9��c!�[��5����
@g�|���}R�kY�,��lfe�Y�N[��Y�<�a����12g	4_�\S9zP���I��oV������E�c�pfO��@k0:z#�65��MF�3
'�B��E���#���)#d����|�{��_a�����I����5�)��c�S���X�Z�����28g	8��S[�bm/9�L
&|���M�1���G����� �m�`�Y�4���onz���L�f6�5k���{���3ypZR[���=�|�hr@��E]6�VWG;z@�3C�,A�5_:��K2s�	�,��1�lN�'�����f��|���c7#���`3+�p�"t��LSL9�poz��4@�,���Y�f�c�z��M�f8m,�<��uc����1h
F�8���x>� 4;���*�Ni$]$��r�Ai����.�H��������
>O���;
�fJ4��E���I\[�|���L"�s��s���DP~��@3)�p�"p��-W]3�����)2g	6��$��9��M2s�	�,�&7�c�:�	d�C��t&���@K@j�\iSF�����&GNZ�e���i0���i����b����h��K2s�	�,�&���dyp�s����)2�f	6E���c�[R��`3+�p�"t�jJ�,�6��Ag�����Ki�5�537=�L�&|�
�gb����������1h
FG?f����K.9�N(|�
����O�p�����1�`����o%��������An2�Gmz����b�`��d_����C�K6�����5�����5c����fR6��U��������%{��3G���<�����i���fN4��U�4i�������M8���`�5U9�x�=fw������I�P:�	�����������>k���L}���Em/=���&\����L��1f����xfM��;��9l��[����3+�p�*x��I��5�����3i��Y�M�)tV=�d_r���L�f6'�������tfO��@�@:|��a�������K�`�s��=����)e0�O�����?�z��#Z�����d�5��)��1Z��4������28g
:_��&j+6���`3)���*t�y�<�Rk������)�u��s�y���<ca�M4s�	����+
[�bpnz��<I���o4����S6=M�(��
�����5���7=(�*���@����R�����h�D�Y���hN��m��pf
��9��9�������mz��Nxg<�sH�.3����I3e��l���5��K6��	�,C�����Sb;�7=��%���t��Jk#�i������K�`��iM������i�T��.����&�_��8u=O��=_� 3%�p�*l
�3����<�`3i��Y��)r�6��V���fR6��U�����.�
�{��3R����������_z��M8g8MDHm�h��K8�'�`�5�>�l����`49�p�2��-j��c��MJsF�`�5�dmNSZ_��}�h�D�Y�����h�b&{��5R�,��yZ']#��|���p�=�����F�EO�|��g�L���/�u�\�����l&e�Y����T�W0Jv�Ag�(,��g~s��,$���4;���2���d�$�\���4e��i���j�������0���Z��.=�LI&\�
��.L<d��{��4U��A�lg�������K6��	��B���6d��������2Xg
8���������h�D�YN���2�����g�4��:����b�)��&'>Z���l�����.9�)�y���-j>�A�M4s�	�,g�����=z��5N�,�0%��-������p�;����ic���x&���:k��rV�;K�>���I��s��s����{���M:���`�E =��S������f�^Z���:�Oz���)e0�O�|,��vx�����f�s�r�� 3%�p�*l
;wZ���h���t&���9k����:�����fR6��U�k���%���7�9e��pN�A���nlz��M8g8M��c&�4�loz��<I���YR��5
]���hrF��e(]�4�Y�=������@k������K�nz��M�f8[�N���n�����2xg<_N�\�85z��C8��	�����n4����oC�x&M��:k��r������������l�9��9������{��Ag�0,������b��MH�C
/���1r��S�5�l����y���Ow��"���o%�|��G�+sP�����{���b����/.�����Z�6��3i��Y����2����@3)�p�"p����|y�������*�q�`�[7�1�c5�6=��I&|�
��:Y�D�!��sd0���L�,3V:s�����E�@��72�c6��i�8���2���m�� 3'�p�"l�r%�C9���`3k��Y�M9J��.����lfe�Y�N��mcsd�t&���8K�)���
�C��@3)���*p��4��}����3{xZ���B�@�����������I�P�:Iz�^\r0�2B��42�����t��5}��o
�fJ4��E�<�����v���L!�s��s-n�G�����fR4��E�|95����`9�����!�q�`s�:��M����nz���L�f6���<X������c��x��N�d�X8e�����C�@��P��LP�E7= �� �}� S�t2m����z���L�f6�Y�����]z��5?�,��7��<�~����l�9��ijN��{l=e����2g	4E��������M4��	����F�[�k�C8���`�5�'s]M���W=��(��
�z���k�`��KJSF�`��u��o��V���e	
*12�@3%�p�"p
��5c�t���L!�s��s�i���!�K4��	�,'�y�������7�93d0�lz�Fs�>c��M2s�	�����"�k�X76=�L�}�@������;�h�����G�@��j��,��r��h����2���a3�����9��ga�]����z��Af�\��g�h���hU�K6��	�,B�M^4[om?=��3i��Y��������t�=�L�&|�
�s>��pnz��=9��hw&v
�</=��(��
��Ln*=:�}�Ai����.�����w�v��DL{l��7�)��k�SX�h�HN@��3i��Y�e�����T6=�L�&�����
�����so<sf�`�%�����K4������d�7�����mI�V�����c�����F��#��oz �Q�hH��NC�6%\r �3C�,���i4�l�v�oz���Lxf6�����6�s��lf���7K�)��t�{����`3+�p�"t��L�|@�$�����!2g	4E������W9�L
&\�
��\�l�d��g������;Y]b��6=��(��
�����p��KJSF�`��u�������w����d�I4�y5�)��k�S��	�.���{��4B�,�Zl��j��M4��	�,'��L�$���^�
�g����/i����)��AfN2��U�t/+��Z��l&���>K jz���i��w��hrD��E ]���{����MHs��`�%��iF�����=��I&\�����������`3k��Y�M������w^z���M8g:m��Z[3\��U:���`�%�BSg|��C4��	���t���������1h
F���#������AhvB��UU]�g�:���)Cd?�<����J���'��_�����cdnz����b�`��d�N�C=h�[��4D��A���s���6=�L�&��
������-X�ok>s��`�5���u��[���fN4��U���L,��'xnz��<@����
ZKf��<7=M�(����F�����W=(��%���@��/����]z��M�f8�v���F�N_
�g�4���cyk,��gV8��U��1&uo��S�/=�L(�u�`�u,rmk���l&e�Y���I�h}�M:�g�`�E �h*KpQe�����K�`z�)1�����������@?���J?�\��*�=���!�)��gV!Sx���=�������2�f
6'w!�W����`3)�p�*t��i���E��m
�g�H���s�E�]Wl7���9��sV���
�>L���/=�L�$���`T�jCF0����hrF��e(��t�Y%XIs�����2h
4y�����t���h�7���dtb�9�7���Y�dp�"p����d�8_�@3+�p�*p����]9F���I�d0�l�������G�����l�9��9��>$X������!2XhH�35����%������T���)+��������>?��1�z��>|q��I���� 6=�LI&\�
����\�q�t���L�'�s��s���@�������I��wV�S�i����f=
�g�@���s�Nk�On��@3'�p�*p�d7O	���g������������������tu&��z�!���4g�Z���:�����C�h�D�Y�&��]������2xg<�tZ.������Y��{V��y)��w�oz��4S���&��1N�b����l&e�Y�Nw�M�Un7=��&���t���@�E�
=��4;���2���<V��CHS��`��v��o��V����4�����M2S�	����p�E�tK��
@g�L����M!�}��nz���Mxg:��Oj2$Z+ak>s��`�5��s,Rn=��o���h�9��i�/���F��=�L�%���`T�*��f1��`49�p�2�.�$����V6=(�*���@�����M@147=���&|��M�h�����K8����E������o����p�;���r�!G��c_o7=�L�(�u�`�UO��=_�`3)�p�:t�%�DW�L��tf��B�@z�7�<�3���UH�C
/-��y^�,��{~����2��]>�e����K���<c��� 3%�p�*l
�)�����=��3i��Y�N��h�����M6��	��B��5o�]��y����*�u��s��HEG����3'���*hZ�g��h��U8�'�`�5�������s/=M�(�����64�Z���4g�ZM���5:���3'�p�2h6i���&��B[��5P�,��yV�a�K�t^z��N�g<]�}�cj�l�K<�&�`�5�d�A>�WlC���I��s��s6�5�[�������d��"�g��f�N����!�����Lo�9G��K�NL�������~A���������f���O��/����~E����^�	���������_:�O�A��������y�����?o_����_���_��r@�7�x<g��[�@���0�����A������������{_5������_~�/��_��o�|�|�?��;/�����o���������������������U2O�'��z���s���_�9��Z?�w��o�(EG�����������t���?���j��|1�jg�7����������7���s;u�a����[jr|o�n:~�S��o��������g����#�w����{��O���z��s~�-C/>�����}�����
�!}�[m
>��Y���ga���D��h����7�.�=�T���]��t
FX7��;j������yl�O���o�����c��<�f��*��i�'������i4�qOc��MS�������������q<l��1c�o���`}PS�`u�M��I�����y����Fk
�13��?������>��sg~���h����6��Y������C�B&��R7�]:���G�@�MW�z'����7�K���o�mF�d�����r��DO�g0v��o���p��MS���c��S������C�{�p�^�wuH��X�������"^<4�]��o�{'9����Fp�����C��i���;n
n��:����}�m
������h����m]N����[���t�~�7Dp��7��)w#'��i�K����C��1�������]]jk���s.ol
�����t	~���wu�'��9�g��
n���u~����u�+�P>�r��W�����0ZX�jpS�k��
=X��!������5�
P�w��6��{��������H�!�����G���4�������I�'������~�+�A�=Kv��}&�f8r~5�����ox��}���x~��<���\���{t�~k���s,:.�,��x4xvW�s���������w��w=2Iu�s�����=������?d�1?:�I�ek�^���C9.|X�1����o�'w��E��
��l�gw���9}r�&6��;z�H�������[����cP>g�Y��O���o@�c�A�M�"����T�eM���/�=�����h���M�Z�����������j����;t���C���B<�t���7�mKY�I�������D���C�6�]�A�?8����v��:�58���7u��������K�C�$������W�mo����Y�)��wuh���F;����!md�xaE����:����}7�o���=7L�~�������y�h]�����$�Ns�������� W�����x[������]]�3^2��X�������
	���wu�/�f���7��7���������gx��5@��x�9�ht5��K�������G���%9L��kt�����?�r3
F^�wu�3���bl
�����t���Y��A����[��Q�q���]m��c��=��oop����1,6�
�7�m����@�!���]�hc�.������m�Z���`����Io�������]�������q���+�^
������f���7x�f���yZ�@����w������~��n
���yf�}�h�fk���Hc���[_����A���{�0�S#6�+�a
���W?^��p*Z��w�s������V<�&hN[��w���wL��o�����:5o�������&�1���J��������Jz@2=����\�I���j�#Z������L�M���MS��K:�g.6o��wu������7Z�����C<����8x����9�{t������'7}�=F�����o���3$:���_�w�������q^��:��������KW��#��T���z~##��Xz�����8����m��n��:d�,�#���^��:$�s�U�+�����e���z���!>���[�7�m����������W~��\�i�1i�Z��`���%os�[�B�{����:��I������.�	-�+Xrop�]j���F�M�W���%�N�>���?�[���}�'����KS��:��,���o���Ik�����C�$"f�=�wu���v��������8��1���_%��z:�/*�+�
n��r��`������.M:�.����m]��d��{���p���������]]rw�)+\=�j����"L�a\��$G��\�'oy9C�g�������s�z<��AM��[��?����h��M���.������M���N���[�������Qx��
/5=���-�����yrL�^v��`�~�?��m8�������Q�F�t������1/]4�����M��-G��t��������$��?��#=���hy�MO��Y�m�f�2G�M�i��u��"o���`�&�?:>��7uh�@�Wq�/����C}��h���d��L�����z����-�����N]x�n��:��Z-�o����!;�R�������Es�-,|��������Zv���	^)w�Ifk$�c&���KW���f�(�M�������w����CN�U��.�]�z��-:���w�!Yg�yD>\��o���.���g�!�j�{��Vb�!����%Z^�jpW�����1��.]
n���+qq~�����c�sp~���,�1)��]���o��(./���h��MS����;{�����C�svsN7�]w����n-f����g�<�x.����w�����JGl��+�J��t4��E[tx�5�m���S`����ny4��Kr�����8�
��R{)"|��
n���1�P������,�{[���W�s�(�q	K��y�����S6Z>5�����^��L�Gt�����]=l����������]F�f���[��oF;&��\��l
��#I���O��s5��/���/�����_��a�����v=��E����=��y�?w��y��X-������:w��]�����s���~N����h�-���U��}W~�q+��{�7}���s�]�N��g������e�,8���7u�����c{�o���1d#����;4����^����C����`��M�3�Q�*�r�c[]-Z�����!���
����P;����*��;������KS��Y.?^�r���/��z4�{���M��.�����d������������o���c/�MW�tQW����MW����F7Kn��:���?�%��Wy�w��ul���t<��*����e��O*b��1�Mwh����������Ov���M��l��:t��%�����y�����=�?�����H��(��%�����C�4���q����;4��Y{0������Y��g����� �|�_���3:�Rc�x�+�F��a�m��/j�5���m]Z�F[+�,���%9~�t]������%i�}D�
n�Rw2�3�xopW�^��2���I�)(���<C�]�����I����]��|m�'wt#�c~0z������Bc���-/��;:��s���K��
T��9��}���s��t�E�t��mE����jl������{K�a�1����+X���?���FM�����^�g��3
�f��_�gw�/��K`��6�s�M������ q���t���t����sp7�M,�LZ������C���)k����o��KW�}���M��%��y����-�u��`Fr��L������e���=�����!����+�uy���!?�yp[����C|��W������C|���������Z�w���x��W�m7}��P-z�����Cs�1<��a=�����YU%�����v�:5��=�wuH�y������7��f����C_�Mr�$g�������`^��:��j�������(k�W�6�m:�kx�d�MS����i��jW���;4��g�~S��?����A�2'��'�\��:��68z�����qp��%��;���U������C���%���\��:�����G�\��/�;7�0>��=�����!>cZ������C�i�)������C����Z+x��wuh����&�M�:�x-X��?�-��L���t�=MdF'����;K���?3�fl�gwT�����$���w��i=X�r�?�����D�,����Vt�)'�-Xx�'{7��<R�-�����.������;��?w;���kDa6��;��7����K����s����'����G������qG_�������������+���T$��5|V	��m��:�X��������S���6��16�]r!�}cV���'h6�gw�m����
:�7�I_�?�����H�x��Y\(Xm�����q��K~Ww���2�<�����!�4����_����Y0kE+�l�����"�y�����	^(w�G�s��\�7�����uVRn���wu���U�Y������CLM�wg�O�����������]�������>8�{�+�Gn.�����y�kp�v���!w����_96�]jN�$�Xn��:t>�<�����3��T�
V���7��;��?�;��7�1Q%=f�������!t�`����wuH���hY�M[�������|Q������+������k���z7}������zc:����
�wuit;�s��YRW����Lr�<buop�]:f�s�>�jpW��|4���������������
��SPi��yn�]��_;.�s7����[9�q��?����#�������4�u��/��;:�,A�8��t�?w�����|TFN6������H�U�,%��5X�p�?wg��c�$s���.��;�R���F�^���w���������Q��,c�������hc����U�6}����xS�vtL�F�\�MW�f#��6Zn�US��
Ls��C��w��s|���w����)��7�MR^���`&���L��������Y&�����!'������o��<^�]V������C��t��V~����Y���D�]��:�m��q�L�C��'x�����^&�3zH����C��4�\<���uHH��h!�Kc��g�+����C������#��wu��q��$�&y�+�In.���:K�F���������^���7u���O[p~���Cz���D��l���!9��w)n���Pgj��l��gx��6t�N�e�@l��F��xM�`q�Kc������M[��,�f�S�7�]�FM��7{o�eG�c	��:�3��	��9���n��^����B�����B�#)�*��������t�H��<#�2#*�]����������i'���"���K��@RDt�X�Yx67����+��������DG�$�Y���[�Pa�5��[g+{#@ �&������j�X�).��U��]E_TQ���Mz�"���7Q=����;�c*���8}��]zE�w�����P���ni�����;o
+�U~�3�^��~A��������^��^e��~p�����_�aE�������o�v��3����OV�m�{n>�������o��u����8�'Ke���;��'������!u<�2������;�������f���ww���I/C��}@�|w�+2��������*��Lgd�[�lm;Fg�eAu�f�,�����@u��,��+��/_�>����������A�����`o?�-�"G�.����:Q���ErI�{��>���uy4���K�������`K��������v�4�.rN�y&R;\��"t������H��g�]����i�������_?���O������������p�����,�_�>������<���{��Z���8��������/52��[�1-��W��>
���92���)�Z�?��d�i��	��mL�o��<^UVV����"�W=$.�����>����2�F1*���C�� QJ.�B�y�����
e$�NH������������Z;X�RW�������hUb������;���~i�G�F��KK���4,�G�ot�={���[�|�t����������M����gG�%�XB&����W)�V���)KF�v<A�����"��/��WF�ri#hW{�#���V�<:U��-j#HA6.�������..G�e�s�*+HB���D�=kk����-��)�V;���~]LG����f��im��,�3�e�(��K�:�!m\`����Q���r�HQ�-Z"�
���������������
��`����B��l���c$��0�_��n�J:���:�����(a�NRR9AJ9;��w�gT�f���K��Yv\;XABp�=�;/�w{+@��{��*��?�/�xh��'\J���m�%��xJ"����OK����u�P
���e��=��������I�>��<��������3�����moB	)J�@:��yX���Ld.���##P9��S�t�T�aJ�<Yq;G��0��]����Gv�H�������
F@��#�T�
�B�K	K��5'�R,���7{p<^���}��(x�Tc�Q���a �A��5G17�!�e�9���&[Cf���WB����J�����B2��m�yX�J�%�
r�FW{X����,Tz�k+P*�|�����U�Q>����Gf��k���R���=VP�_����ua�O�ilW�����f�.>��������5���=q��O1�B���DK��o&��g��
��=[\�/��	��g���z� ~�N	,�r pN�w6Qz�a
BH�e7-���j��R-MEYG�HUV#�Q�YN������j�2�KD���d�=�@$G�Bw���c���c>�+���8�XZ�V�miMV��xy����H^�vD�&��(��Km������ue�&e`b�C)>��dkm�:�v����;����/!�#�3�|7'���_������sW�1v����U��#u�0?��0��Q(��z��k�K�-���z�Uj�K��]I�R��i�1v�����d���
�;
v���/��K���Ek�^��^*��8�e�M�Dn�`>>-�����Ue<',��j��V�$u�L��>D��V�i�������Brra�Z����5�M^�-1/�3���K��t.�v����������PS�d����v�4T*N��������v��F;��+�P�����1;�xV�+S������HM@b��c��9�*
X�J:_>#�L�PZ�����5�DS�e
�"K�@���*p\C���� �S���r0��t
s��
�G��)d�@����M`�*�Bz�'W9A����{����y=�Tu�N���j�0:�����UVs	���{�V;X����^Z	�O�l�����E��#�Ad���(dq
+�����J���-cM����G!���^
�G�5Fs���_I�O��{�%�;?Kz'oa(�(���5�+cf%N�xJ�����$�\F��Y[;XA
�%N2Q{�v�Q������������m7���af�7��zh+��p�������1I�,�R
QH�`���*����� h�LF�Z��qX�����'U��Bbl}����7S�sF������i��y/��416�^#��e����*�#�6��� %���y����`	�,R�t'DTV�"9�t�=%��v�"A?���m�7{m�QVq�^b�;,�'u�pH�1�\��,��V�-:���������I�;")T\\r�q=?S`�Z���������7s�}	������!�k����,�9M��f��z�WL��$��+���GV����������( ���Cg����
��J���4������C���^P������yYfzAUK��I�/�T�*�#6��Z]����e�iu�@�qH3�n�I�|����|<0X�����L���g=�,��%9)�s��
�|�9�<x�/��WB
����8Yq�[����Ez�(75����<�@A����
�+�����D�_��a6RZ��$kR�H�V�t�AN��.����<�����_8"	����Z$�j�E�3�#FY�9�4X��y\��^O������q����fn�`%BI�y
���k*���X������PnJp�u�c�
n�)�� >�0�������f!:F�s,����#��
�8�L��C=������Wz�T�a6��w�
��y,a'M�pHM��OI��Z�{�;�%���q��9Q*�Z�#"��9��\��F?/s3������t������2����k��n���g��3�6uyI�s3�x�D	[�!W���L����j*T�`�*4��o�b�Z��qa�	e�T�.������*���������4X�*�,;@�2w�1vL�B���>��d��X�uNb���Za�������%��WS�#g�����
������9���?*|���f
>`��U��G��^����t�&~�B0�bF�!���]��)�_�|�*����y�������LF)\\�uf)�6�������!h��=���<X��bY};)f�qq������=;\�/)�[����=�	������K����=�����d8$�)����:�������q�[���)=f��y��������Av�HC�
����)�!Y&���Q��E�mj��`�6���.
�l����f���0 ��7�Y8+{#@YS����Q�`	Sv1�N���
k���{��j+H!9YCZ5�O �F��P���:k�k�%�����e�=u�2E
c����u��8�{i���w�q���;^��23����,s�Hv�J�������R`�� �>*ZS��k&�#
}�K�|_+�*�����#��.U��$
�P������F(G�����f���~���7�r0���J�w����f�.���~��
J'uA8�.(��3��ql]\S�qu�����,)c9����������_&f�U��7s�C����1���T����ss�h��#�H��2�
-�}�����D���'��`��T��R�`	�c��[@���"%�n�T9XA
�r+W� �f3)�$6vC������ �
����G�Q��{cA�_����������a�����`uOzq��z���	���9jj1hR� �Y���v��,��s�8��U�>^w[V����U<A�T:�
y�R�x%��)�������Mc�^!���
#�#}�\��`6N�i�������j>A��Q���Ty,!']A8�+H�)���G���H��	����I��O1�!Q���B�z
>��4�������"^00C��5�y$����{y�|,:�Ul�Y������+�F��&�^���R������6��� 1'(�:�[�	�S��s����ml�~����V�����v�N��a	8)
�!E�H�s�.�(��������5�_6����c������^&0�� d@#������`f���p#&�@i�i�2���9��
n�42�p���3[U���R�^�VV�7�hj�?�/�=�@Q�VL������L.�W��������Qv�m�;+y�����p���#�%���'G�4)�U�@���(���_M���"��B�
%���)�!�Xh�w���h��X@��	|�n��)��l��g��wO0��7�`H���+b�����^�QuVV��P���|2�.���O��=.���\����j��C�/[�;eFj�KC�J���q'���<�!%	����2�^��K�I�E�t1d����c,Ur��q���A���t�i����zj��`R�\�$��l�e}���c��,�C�C�5eQ���$��>�Z��c����w�H����KB�v�e~u�A�)�'r����_�/�:$	��d�!k�,��.�'te���.UH�L�;ck��B%�B^:�\j�P�U����������;�d#��f'd-*	�����5M������"P(zo4Zz�HL}r�9kNf&
�~��_t>4_[�	I��w����|h�v���S�]J y��~mtA
�9��������
$�Z�#c�lS�`)��vV�V�F�M��;O�`��uVal5���no5F$�"E��;�hwXB
N
�0���c��������������S�X��gj��7kO������>��C�a��d��XT{:S��l�q���������b�&���q|�8�����T�����n��� ����dfw"�� ioY"Z"|O������s�����H��v�y0���Q{gy��2�����p���S%���7�
S�4\��_��LT��U4fZ�x=7���!��=YB��y�|`j�=�r��i�������)�,8K���Hc����P���u��i���q2�����[
$[�c;����l�������Vf��q�������j�-����.������(�l����:��~	8	��1�����J������nf�&#�AV�0���=7���=�i�R#�x��K��g���e��^f~^Tr:�I`�'JD����?�(&r�}A�r*�G���e� ����{|R�PyX�J%8�sR�����vn"��^P���DRB�>�=�F����i�>3P+�A<���#�}�YO
�6=�c�/��uR�)��RH���[=y���I&8kn��e�2��������`r�s��$s�~�V�,j,;7��SMO~C��RS��u'����D��e������>�����7���U>w^G�Vc�R$���9H�����6�{���[��/�$�����.pR���+1QaD�v5$) �������u9�Qx:q��C�s�F�bj�j�i ���j2pI�}37���z��*Hk��k�����0%U��2���B`+G�X�,V����i��%;�����voz�a
4E014���
�� ���nM�Gf//���b�YJ�a��#	�rg�#�%�������"`T�z�����)���_8%(����%�����������TTL�{�5�kz]��[�����Z���Q���<��X{�d?2�z�qiE��Q8�-����XC9.,T��#R@
���+�y�cR�s�l�-��'`+�K�d'�������8�T��F���*�<.
6h�����'Xw�y�c�(�T�6�b�<�p��^a<�W(S����cz�k�y�_�����?A����E^O��1�#H#	}[�%�o��C�I#T��H��M.dM}�����#G������s�8�E���������y��������
���|��=��]�9��O.����������0���=�\N�s��[;\jVE�@�|��H�pi�%h�0�������@f^F�B{�����#Z���+
t����K��I�2Wn�4���vW��/:#;����O*�Hc�q��|�V������Zc9�K���L�O1&� K�CIX?��<�����HyL��������2��Qc�RRg�m�`I6;�H�y<U;A�"C� �[H����C�`M��\R;X�%�wjk3=���j�H;������Vp�z��Y��4�f0�cM��q��)�$d�0�v��������s�~HqP�}��������d�!���t;7����R�<?;za�_-�J;����������++H�5�*����$�����[RR9�Ab�E8Q�(�F�2��{��*{����-��@�Dz�_�
N"���f�9s	UuS-k������H>�Gr��|�%%+������$��HG�b������/�>�9��k&����}�)PDH4�	����j��e�;="���e�`���(�Z'��� 
����Sj�v����(��'���$�j�#��f3)�L1rgP������!�EAA�Y�`�D\c_��mH��O�@���
iI�z=_V���4��H�s>
�
���:V��H�s
~N8�>�uv���u���B�|"���v��T�O���������h�2fG�g�<?��^\�9C�|I�`5JZ���������(��$J���Qj�%���^�/��-�Q$.c�v��|�0��|��T��V�m!����g����~�-��@/"S�GE_Y��
i��U�_�F�D�<K3����
���*���Net�S��v�����y� ;!�f��t���Y"T;XA�
>���J�ziw���E���2c+{+@	�WM�ND��>p0���[s�L%��K)���3,�!��4�P>�0�������W=;���lPT���:#�����H����\�v~��AV��][t�2���KY��U�
T/�����(��FnJ?UyX����D�����)j|p0{y�8�Z��[�Zy��J�Q����r�������<S�,�#�� �����,��v���1����5r]S�K��1�����B����;��|2�>��dk	U��5%s���������uM��9��0a���f������`HT���oD���$���W������.D*(���Z;\*��|L�z�V��J�Q����K����P8B�msS�����!�'���czvXBoN�����$1��������WS���&��
�y�UMK�F?C!JA_�����<�������!7���%�n|;_�MP)X�0�U��$F���y	�������?�iglU%�}������s@�����=�V��o�V�*��c�+��6�.=�I-_zSB*�K�Y��{�j�P��IVR��j�'���~	o:)V�C���U�K��p+���w�����7M�Gm>
Q;C��4.:���=k�����V�q$���}��w�*��gs��k�G��s��pli�4���9���]F�Q�09qg
geoH;�����{c�`���������v0�St9G�8:G�r����G%����W�`��
��(�VS�KH�I�1R�,Y�x������5�������@������]Q~1����>"r�q�mZtU4�^�-��w-	A��Z�^:�0�6�������_^w��P�d%
��]�v����'��-M�'�v3H���M���WV���'r������ O��L&|
��`6��E����[;,!'��tL�R��P8yh�t5u+���%��	���i�i\��^������S�}��n�����������k&���L��������^�<���mu98fso����
�6���7��r����R�:�tk�Q�PUz����$.h�m7�����E�wz������d��1�G�r�k�AB�Rf]/!8�5�#����2W���s��������]��ck���>!�zC}X#O5?c�Z0T������]mb{o@�2}����{������_g�`)D�5�m�;?��;���V������tv0���?1�f�T����Sof���
���!�Gy]>�G_F��6u�El�0�����B�o�-y�V���$�)��U%B
�1�r!x���|�E�,DL�����_Td2�?g��(�y�k^GW�W��d�8��S�}++HL.����f�`I���3�Giw0�$������no%�#�KK�	���pr��'w���K��I�/�T5
�Le$�m]xzu����n�� e��v�$;�z}��(l�7�WM��o��'w�H��\lMb������\$|�+��;��R����7H0��G�r��@gC.��v+Ha�����T�`���,��3��v���X�0t�|�f�T�L��P�vXBNb}���`�$a�v
�a��:/:��Tg30F�tC�����S�s�I3"#��Q[��
4�F_�{�G	E���GY���#�Rb�I�������::0"�s��X�(.�?�{rzm�
E��<��B�%T��������@A3}S��N�p�1EY��NU;\�g����j�y�#
�>�Q���VS9,!8'��tH�Qk�U�E���N;���������'yF?���URN�~�����F������'�%��TZ���D�Eg~�r9���!v�_,�<�y^��!
��8����M|��=��]D/�p�a�)ze�`�P=(A�8S���.�������������\r��Pw�KC��qfj
>��;,�:/�6G	j:��+��4!qXr����
��qT0l�������m�K
�����5��s�<1iANjH7�����9�����,��W�t�������&�7���XT~��S.#}��c��t��������I�t^.TF�8EU��3=�v������%L����j.a�4��;O�j�Q"t*��i4��bpRS�#j�9�(�-E���JK=b�9�q��
`
k�J�_X��zj�7�t�^8�������E�`j��(E�h��j�C-a��ss�b�����<T ������_H<.F���o5�� %�r�S������\JK��	������Y�\�'�v�1�����t��f�(H��Ko���a	+8��1����Ll���3��q�[aN@��
4$u��,:-��"d,FL�F�t���7s�	9�(��f�B�jmI���9��#�)Y)����_j��:H�����F�L2�� etr���E��$��+k��Q�� i0������`�H5Qr�v�	���l.�C���������J R�T��$�ja�R��KV��)��x�}�e4�`M:��|!�^�T y,�r�B���4�5�6�5=no��=[R4��:p)��@�iAT�3�7�eeoH��� Z���[D��$	�U�0v����F���Xg{���j19Y�%��#���<bb�<���P��, �d��A�\p���/I<�����Dx��V�]���/��
��o5Un��,�39�Y�>����-"�Fsn��B�����������	P��=v^��V��wY��t�?�V��{:���t��$�X{������T�(K�
�^��f�^��]������a	8��!U@�H3j�	*s���z������%1�P��uz��S�_��rC-��KK���I��M%�_�/n��K��~��kF��qL+�0�p:`�z����d�{+�j+P)%>t�^T�f�������R�a6J!;Pa��C�GV� ����3���|�a7�H�,�����X�
N��xH�����{��85�jO>��,��4�`LZrM�z=7���!0����85�������>�2ae<������vn�U�\h�H����e�wsa0$i���/���i�6v.���<����Jdj��o���J�X"R����S��/*����P9\zT��	��Z��Og��a������(���O}7E��zs�;�Cr��L�B=���I��:�^<BB@_�W��j�E��3�_R��RU��u�j|�1�6�u��Z��K���fj�o��������6�#��W���������
Q��T���He�`��! �A3��C����P)f�X|������.=��;!n;�j�C�K�!CgC��a��p�����3���)���t��C��Z���L���'
��q�s*B�xT�}������e�����l)W-:���b���U����vj�9��7!
���~Q2��0��j�1%�Y�_;A�N�g���nn5B���r�,��� 1���4b�'���V�LOj)w�)�F��J!P��������"RRZ�(t<5��.��I������`�H>����}�������6����!�q����Ig��@��|�>���)a������P�H�C�t����+cV�����.v�HTf���0=u'HTf���l��U/�� ��S�Gt�kTV����)F�;H��$��'`��;G�rXA���GG�4L
�U��)2b�^M���L%
3�=�2���	|BG�3��q�����i1����W���)�\���X���G2����5��X��������;����
E�.���NH��$p��������2�N����
P��2�^����
G������a	8���1�> ����zI������d�#�������^�s�# ��[�=���i42�^�^���.;��o����
Gj��l���}F`|S ������s
�l �L�%|o7�������|Z���T;A����k�B��FI�&)	�������(�/��1��p��8���^/�Zi�6��^M��Ab����d��cM[�)�������m��z�e��q�[�:�����le�E\`j�'A��Ha�L��d�^	0;�g�+�E��P~w0<p�(k`_�����D�t�4�� IU�������$K!�V��'��
G�e'��T�vXBN�|tLW��R�r�������s�I�H&=e�r\tK0_u�=����p-�K�n����>��-��P��2%�)��#�0����"�`b��������)9�A[�t������%V��>D�� ���coJc�`�$��{7�F���u��������j�Hb���;sj�%d�$�GG��2�^�'-���;\Df�
ZR%��V���	\S�9?d.>��Cc�iXD��R��� �N��k*2�F�1	t=�IM��KWd>�z�~���^������>:#��5�O/A�W��r��
8}v�6��M?;;\*;����4����P�|� ;\���/TV$��PwD�05>/4�Sp������a	�9i�!	E8U����C�����~b����������"\��'fs�m7�Nb[�m��|��9� ����zy�|��b���9���b+9b���������w��������7��������������v������v����w���|��Oo����`s�
����;�O�O�����/���=�/��/������68�p��=m����3�����_�������������l������_�������S����?����?��t�g~�{����������}B�7���f�)?��ww}�����|�xCKO�m���;o/?1�#��{?���������q�w���w��G�'��t�5������<K{^�b(
�/�{������X��3��?}�W!����#���:/�B�}��eA�����;���w����������gA���������|s|s|s���q�c�E(���w�����K}��R�?}(~����
�$m����*����b����.���B=;\�������6�������2D��b��/
T:�Vw�'@��������2����{:K�+��O)���!�P
��G�aM����0��R�l^j�imu�&�s����2dZ"�K�$n&��*x��=.�uQ�8V_��r;>qA�9��*T^���3/�Q�������k�����g� ��:��x��Fp�mq� ���^XL��v��|ag{�*����m�w{���i�N���~8�Y�R�O����J`M��Q��R1Q���Hf��T�F�pb�e�K��)��c��/u�$�s�=T��'?� �eMm���4BLs�x)����F��Bp!�*l�=����yr!2w���������s�v{+@Z�XTg����
P�.�lu���no����`:v�T�+(�)��Pnx�"Ah�2"7�)��a3�!@�1"�H�o�k��b���1��M �������c�XB+�U��D��v�|��1		�C��b
�u�,�:�b*{C@�����9@��F�R.�������������F�S@�V�0:�nc����[R��RK^�	�������^<�]�m�P4�^�������,xY��*$��������,�3����l�_R�w3�T#)��c[�����	���1"l�$CO���$�=��	���������V����1Qk?A�;A�����ap�`�h���������`)iM8�NA���RtA���6��
@��#�������~��E�u��C�t�`k������.���#R�!���p�����	9f*9�����H]h��)j�E�VT������ ������]'w{+@��a&YH:�V�|�N�����W9XAG���<At��(8�D����f��gMS����`���������0BdD!�\C,�E�f���H��4����_R��$���)���Y���h����x��D>��+��5����
@+	�y������X�B`V=�D����v0�$����S�x7�!�_��/}'���y�oTb�T;���j�������`�lg�I���������P�y,��D#��t�%��W�HE,9��ulnCK2�����N���OYc����5`b�C���A8@n]�����FRpj�G`D����m����`@0����+�G��R�$Q�P�����a�?vCE_`���L��.5i.����K�'��;\jI����{;F����C��Z*���w�y��M �������������w��n��"���Pkq�5���&��\b�$��o�������	�������f�y��}Mm����9�B)?�x@�����	��p�,�#C6V���l��y
��VhZ��R|v�k�����_�ahh���<%v�W��Sv�b����/
T�/Q����*�K��K�MOe��@�����������
����!?p!{�3��7X��h~�2#���&-�k�@��gB�f&�RFzK�U��s�e���*d]V:?RbM&NC]���E�?�]�e� PF^>a����� �f��{�o��6�H�)�:��L���h������\[;B����.&�
h��!����*}<��_B�iA�n�5���!QX��8��gf,�%���#��[5�.03�1#�$�XZEti���|j�<e�WY��{;7���J~�2v
��[����Y	D���}�� �Ka�L5�����r�_�3s�7!���p���`�7��;�{�;��no6��mm�;Bg�%d���i,�R^P��8�R���/	������G������0T�W\XS%:��BU������ ��d`~�	��P�V��d`	��9�1
�(+(�Pa������`f$����w���%Y�u6�:�[B��������H.i���:����Z�I�"��{����+���/��O��p��tu�Nz�U��L�[���N �j���*Q=X���g���4�� ����������1��3cJ~�y��z
4'f���6������^
H��Z���;�~w����B3�7i�v��*��dj�m*��$��%�n�Q;XA�d5	�������K�dz����;,�'��pH�Ob"�K	CkH[Q���������7�P2D�f�nS4bQ��|�+]�)��k�DfIK��)���DGA&�`Rnp�d�o�F�30��-E�f��s���C�����+v+H�����+�S9XA	�F�G��^\�f����j+H1�r���D�r��$����{[H�K�IT/c�-Q���6���9�Nf��$�z
����.�1����"B
F�1������<�/�G��5yh����|�_�KH�%��_w;)c-	MH,�^O����$��{��Yh��Y4�q��w
^�[�QvA��z�$v�1�,�"d��L������O���Q9,!'q�pH�0�������#	�eQ���q��v$���,�����rj���$�~�>�!M��x���HU��)���Mu����c����f�v�_��HA���$(�����JH���Nzev�y���q�������0qk�xztrv�4����!������PI]n�B�Pw�y�c�����e��>�;;,�8'��pH?�$a8)'�����GWS�U�	�}S�#��6m�E�3�}"�o �H�Yd��#�L�!h����J�[eEN���LT"K�14�������#�_�|�:���y9��������&L	��V�7�k7_�W��j�B����N^Y9\�&����I�*��C%�|�(�����P=9��;+y*�PGb /��|$���{
��������!�M��gQSX�Rq�>��q��I��@G��|���U���9���@sL�t���������)�c��~Q9����W]��}��l�����/�.#n�S�Soc�����|����.�� �eJ8�Z����[����u������$blE�O����K���sN��K��I1�#j�g����s�?�����-����I>H�����f�K�_O��6M��5�+^M��J�)�E����NC������S��9HD���>��<�i[H��"��3=d�7DA;PQ_����jt2:�z�yv{3@�1�r+�z
���
GC����������4�<���W08���!�?_0F�ZS9����K������/C��%,�]S�s=5�%	���2��%�kT�����E~��Q����ss����C�[Bd^wt[]�,;�!����� N��e��3�p���eE������9���=�[
�
"w��V�fs(9�[�7����%$�$�G�"e	%����#�����8���l���
���k�-�F_���n�����g���u~�LT0����6�Y��9���C� �C����c~p`{#�}�.#����;AR�\�A�`g.^�`5JY�{B��;��++H��r��Uv���`6J>:y�u�t��VsIu��P�pK���������PP�B��#�
�^��z	���^���cj3���5y�S�s �)�1Q��u$�&�s
=#'JE���H�-!��F�j>G�$������u�]_���z�����no(�0�B1vv\���1��}o����
�� ���'���V�"kqL������e���&��p���C�	#$N�3��5uZ��u�VAI�T�i��z]��|MU+��b-j�%a����)9O���`���S�)�%sD��dy![�g�b�58���lrX;XA��5i@����;XA�N��;�k3H��6P��tv0{q$�\B����� ]r�}q��:pR��#j����7s�6,$t
�A��nG*	1�[��~��l`
>htNJ+>j%��I���K��=��L���N��Y�A�������R�x�b�@p!�& ���0�t4��m���u�r�,T������{H��J��]-��/
417�5=-�}��:$a!�*:��Wlc�_BjN��0,Z��D���K��CK,�����8�-��d�����CX��y=5�%���!��0�)��
��f
�&�AQ�5��5�`n'G?���2�5�cl�@Z3/��������U�`�B�t��-����ms@N��o�������/�'=��N:��/4Y-<�V���@w����%`dO������'=��k��~	c:);�!]��e��9�5��k$�����U�/EDa}�3f[��������g��-I��1��)�%��i��P���Y#"1�Sd��"�(@���	]FCBB�(�8��������D��W8����z��R�l�V���������	U�F�H���Cu����j�b�D*Z=�����~	8�����}��lBL�#��+�uE#����d����,�oT�|M;��)���$����c�BT�
���s.%��c��~�6��������!-�!2����+�V��P���R��Teo��|Y�:�*{+@��;:u���f��e�dB7�{+@Qu�C����i����:��������� a��=�P�D�K������T�)Au�:�E�k����~V}F����c�-y�=�������4����e[�5J�S�9��>hG�����r]_�"����C����]/w{+@�S!������F��|Y����7TF��n�S@�V{nQ�]�{S�[�dm��z����KH�IZJ�%��X�Ua�reS��/�H	>6��[$`�6t=7�$�#i��#A���5������B��k�ncM���)��Q��H����� ��e%mo�k����9�mR}�`)H,��
�E�I���`I�FK��[z�;�AR��|g�F�`	�i���T���l.at ;j+���\:;� ���)�E�MiWw��S�5�@W��o����C=eU�\�ox=����s�0��c�\s*0?���`����q��9�)�����8rE������c�	$��c�����[!��Z����-�U����T�V#�����/�������W@9`'���j��&��-��'#t�_�N{���^�R<�L4��5����B.�(����l�>nY,�j���"!�J2E���T�_N��eS���HS6�����k���F�s���`O�������>��H\\l�=
��V��"��U*�����`	�c;�o�,1�� m�����j3H�c�������J�q��J�K��I_/�����O��%��5���FGb
���=E=�KX�kt}��e�j`�G���`�f~��A����?x �&mpr��L��W5���#���M8$	����XC3.�/<Y�
�����������)LD.���%U;\��N�;��k��CM��Wi�	=�zv�4T_�e.�yS;�C���T����N����sR����*"K.�g0���)���� ;0��a��fX���zn�S�r����NZS.1?���R������*dMn�q�[j��9�D���v�0y�yO��������J��n��e��h�idt>~�P�.������y�w���/��
�:����a]�_hV=yl�<��/=�A+���t��������U {�^rD�����^���~	+<IK�ai��j�l���0z���4�0|=
#�L��4N�h+���e���^/�A9��������"����1]���0�v~����A�26���*���E`N��@������$���|�}<�� ���,p��g���+��&w�����F�x+���V_�[����z��9Bg�%d�$�	2z��e9�4r5Vp\��RPFM��8��C�Ed`f���,n�+IH0�
|M�����IBQ�mk��&��l���'��s�9�b/1�WF�������*������
P|��R���
�W����<�fp�lYk;S�v{#@��ca��[oeo6�d[J9�v���p���`<�S�i2	�F��A6�L�2�)�eL[���S��	�[)��Y��=������l��/B����C��F�|
~faFa���F��T�����!������
e��S�	����;�<��`o%�@�N����
�	{��*{+@[#>a��x��l�'y�xH1&��X�H�u��W��o9�D!��F(�����]t0��h�5�D�",��	��9��� e�����E������A�4��)����]�����6 �d���R0{ry~��t����ig{����SB�<����^Y"�:w���
P�!���>t��T�}goF��~		8�
�C����~����[��"�����Z��9����&�S���f?#{�k�J�i�Bk6�����2���}n%K�j~��L����'$�^8�x%�� ����s������H<�T��f#�$j��S�������������hvUx����
N�N��>Z��/!'E�8�(��7%���P����&��(`�����[�E�����%�2�����(`�_�6�k:��_���m�q���>��C��3�R���q�W~`��|��s1��`){���v6��� ���T������`6J�]�)w���f�����3U�v�������{<Q;�`�$'�IiR�LSJy(�+;Zs0��D�1�(��X��"m�)�2���dwgn�#=����M�$�8��O�g������_H>E��Bfcr/���R�x����#��>���0G��pP���:����R�.k��	t��4P���ztv�KM��R����|��(����He�Y��Js�FL��)�$[R�����(���b��������L
~/����!4�Lo����I5s�@#������t�v~�r,>Qn}����M�GF3/��������U� ��e�3���_�g�H�����������0P,�<�2T����#��<s����=�_h�XQS^�he��@��G(r�}�w���tR�L��GQ��"D��5�,�(����9�-=��X����)�,|��5�~=r���������}�8J(����j{;_v��D�Z�>[�8�[E�2R��$'�Z��E�����3������ �2O���1�����=v�T�V�H���4�����eT!��v�_�Nz���$�T�6��{#�;,j�qu��j!}��*6;���R����!cI��8����j�>�1H||�[4�dpMy���IV��$r���?�u��^H("�w)s*����� �9�����Pq�6�Ny���
P��4d	:�����
�z�7���dK%�L��������^:�!��'Y��Sk8�M1�HE��e����$�R��\�Vf���js�)A��}����Lu�n�����T|l�_��n��g,\�����b�����V�,c�9�:�=Y-��V������j��:�[bv�B�7!s�7��|�~@g{+@�].1u�iU�����,����l������!�Emm��Pl;X����8���E��i�f����^
0�0`DY�<���Z�k�E���>�����\Df��T��2���[�I����	�6%��������
P��#��>���pt�T�f����������PP�,Y=[Y�����F(��o���Y�/!'��tHsP�jc�V�HSraQ~�a��cu�H����������|m��IH!��6T�����_�*92{�cI.����I��t��:������
�\�:��T�f�Phu
�[Jeo8B�C���nn�;-#���`oU4'��y��[���"������pR�K���J�����A�����b
��'2z!�T#S�����O#rQ�����EwS��@W1����k�Jn��o`9C� ���)��f'��[E��,���� Q}"��������:��+{+@��@9��U�:�[JQc�Y�Z�����if�?���������)�e�!z-���u����Ws�Up�c1cK��u�F3j
~.�4����0��E5�S��aP�T4Qz����{��9�)��������!z���;���%�p��D�1��{G���E��^���w�Kyp��wO���/>�$��%����K�,�����W��:�!��o����~	�9��Cb�)hC�\
����FZr�~u�v�Uj�����mk�#x=7���Gn#u�[v������O�=�<�f'8�c�)�������K�x�E��i��

��N���et^�������NQ|d�������_�SH����qg�ea��@XM��Zei����*��	�l�{�h������x�^QBg�ne����I��U��e�6����E�4������v[�;P��r^/�tr���A.l���`0�9z���"��xD\C���6=���V}h�����2�����������7����{�����e;���veo���M����j�wI�^��Ieo5B����w���z		8	�a��b�4bP�yM������mT�Qf>�H#l�������/���Ha��'-����}���'��� ��%�d�sScF�H<�66;|�+#Fie� -���T*{#@I/Od.�-����.��>VR�[
��,1l����F(9m����S��� ��K+���z�_�N�}xLr�9��^;�
��XS�z�v������L`�*����v"�������y��F��O������~I��9����f��K���J��#9��m�}�X��V��lIt�dT�V�Kp�jX������j�^@�V�]I�E���y07�AY���B�;�/��OB}xH�OVqM���������/���/���;!��
�V&���s�cN������(`�|���Xx0�s
���C/A��I~<S�x����<�N�:+�*{#@��Kz,�wP�[�����a��R�[�+��#�����rnuD��no8:%7;P?7<�����$�����|b�^&�P'�������5��2�3-����k�UL�~	A���0����I��9����C��Zzp�������,<����/v���+��( ��JR)������U�@()�V����lo�K�Rx�lo���N�f�Ge8{�Nvs�-t}l��OO�����,����	el�A�&���8��.:G�b�,��]��S�_yO�L��~�_S�1	?�Lc����`m������i��o����^�$��E�Q{�&��,�� ������@�v0% ��k�=Jg�QpD>�>���tv��:5(ag�@�`6J�]����a	8	��!�>]������%qI@}5_�$����~�y�}m�D�#�$m5hmi#-:�}�A/��g���������%�fZ�����n���&^�0��"���7���<�u� �(���_�������3����8����n�|����������wf���@G�)d�E	�d�����Kh�Isi&`����:4jl�i���q��������l��TvX��03���GQ�������"Z3�Xe ���h
�������������m��=��yy�������?�JQ�.����,����ms@*���V#�o����#���w��/>��yRY��}��0P�Q���f�T����A�	�����'#z�_��N*�xL#S��z�r���_S�����7��B�>�4����5J^��� �u�bb�<�����f��HZ�v)��0���
r!�
�c����N��L���O�������D2-H�|:�����2�V�o��V#��%r����Fb*�d��7�A�rg;��|	8	�aC��}C&/�y,Kn������(��|l]87��:;���}�X8�eX�E}�o�?yH)�Be�S	/j{;7��R�����X�+#V-�R	�"So�����Iz��U�f�H�O���������OZX����
��l'��r��� ���O����-��~���
+�m�i�L��|�!�"	L��FM�/�����Bi��4����\@�;}����k�]o��'�1q�,�`0�uM������1�z�`���2��
Xmq�	�sG���Qt�Z>�	�lo�gO���
�@��,����]d�E��'*{+@��>����:�/!'=:&��Z�@������^0��lD��A~h/��/�'�GH����h�I	����/_�J0��-9��5wS���x�Z	[�����l�t���{{OV�F���gw�eo(��J�v^6��V#�6E���^�[�!�:jnZO����j�b�� v���|	8i���<��E�^�sS����]���_7��������E�QS���\�������7��oM��#�����kj^o��dW)xl�[��>}GU"?�h�D������+{��er!s_�eo(����P��*N�V������ ��#������
P����;�T*�%$�$�G��� �~�/�U~kH_�������t3��U���.`=0���p��Z��9�A/bX���p��0_�Ij
>�T(yV���2i~�N,�#8":N)vv#:��I�N���"��������DG��V��'t��!=����>�T�a<�;�1*{�b�",�ySR�/��O�ytH;Rv�H�����US��`}Iz|�z�����m��Rs����)��2���(�0��m����F������fF������l4����|�\�D���Z^ Z��w�>#���&���S�������Iee�/h�5=M*{��8P!A���w�K
��}gJea�z$�3���sG���@���0�
N��;e�+�%\�$H���������Y��Rj^Dv��g��3��4�wZ/<��d����u���`�����)����@����a�]�)����aHi(k�]f;?�/�>�{�����o?������/����������o>�����7o���|������������>��'���y>}~�E���g����>�����f��a��J���~���O�����.u���������������=�?=;Zj�����_��������7w����7��rf���~}����>}}���/�?���'����,�����6���_>���[�F]�����<?u~y�o����+���?�W-����_�~�*�&�O���?�������������1��|�����O�����|��7}�����mO�����/g�w���������F�O�n���N��`�02_>�*_�����V��������?������)�����_����|�o�����?�x��:�4������w���#P��:��������������O�����������O���*�/O���<���_d������o?��_~��Oo�C��_?������_?���w�v����������&�}�������������a��t��������������uO������������}�����<��w���"o��K� ������>q���&�3?�~n4��������������I���_G{��~�������C�>���}������?������eY������������G�=���<�����k
aKn�O����w����,v���1z�g�������{���������o~����O�%L����<?���p
�'?�	E{
����_�~�{g��|y��/���|&��3�(�k�N������8^������������?��������q��y������>�}����Se��g��
��ce>�Q�x�Q����_�~�����~��yF�?a`���E�.���"��:!d�����+y�/�Q������^���	�e�n��]�5���	�B�}�CE��88+�VG��������j�8�l�b����[��oO�
G��|�9��JU=�� �v�RU�f��X�p� ]�Q5���P20�v�;�'@`�����H��U6�X<���K����T���	�B4}�@�p��
�����	���T���=!������PQk��jv����c3����p��	�lNVy@�]�n��H�d�I� �D�+k��l�p���^�<!mUvOH�#g����S.�+k�m��B���E8���5��6��'<`>��'�7-O�%G��r!C�z�I����!�5�g�o:�,q�a4]��=
�a|�5�^����I%�o�	t��	n�`�~}���>8�Ih�P��e�f�f���CB��\}��g8H�}��a�[$X=a/C�{B	v�>������TU�$���{�������
x�k�����	A�U�`���	1�����	��T��������o�(�1���i2~BRQ�<T�[=����Y|�O �Z7���>���P;G�NA�	2��]��O�D�M�,��!��_�T�K��'D�)�}�n�
�F���	��d�;,}B!��e$,q%�����K�0�Jz�J�Q�<A��Mv��	������7��;*�8P`��Z��2|B&�P����]2b�@��\��$�����	�������!?�$���{���.�E�&8�b�5�����
i��1��A)e`�8�$����n��	�ox�O��dH)�	�>��i�
4�J��/��/zoh�����q�	L1e�E��fh��l�P���{B�@�G|��$���	����e����{H2Jh�	�$P,`���O�>���'���u�.��g	�V><fknO���	�'���Q�i�������=!���>��a�|���M�\�E��~y����X�;��5|��������h�/���&^Z��h����J��#�4��X����Q"��"�k,}B&*vI|��6��d��I������D����f!�>��]�E��)"$�mZ��v�R�Q���']$>��k{��m�#9�^����������;C��� �t���w$�ub �g�O|�����\x��a�n������u��z�p�Vx�O����aI��3DH*�����u�`�����@��\��t�e�O��!*87$S�1u"t\���G�#�u��6�..n���4��,5����.?����}�;M�����-�E���Aw�������6;�~u��m�Z�������l"��nR�-X����T������*�KR��P�������Iw������tF��#�B.
"������|���B�:�VHHE��k�j��@�$Y���OL+CV ��pw�C�N�.s���u=hr+�>T�0����CB���!w�0�/ �\���}�aw��[�D�H�G7����;l�ahZu����S���DZ���E�1K��@�*R����.[7�]d51aI�`'����7��b�v	����R�?���w����K�Y> 4b��]�?�"b��ir�U���{����y8���V�;H�d{�}��\f)d�"��
YM����,��qe��Z���U)���a���<�,�'Bl������Y�������:�������*������*�����:������<;��t_�����6h0��r�����_re t�yu��o�A�u+kc��B�W1�i��|�G��[2:f'}�/�>zM�����:u��1y{��Z��3�:�f�������=��u���p��6�Y�q���%���)�.���0���&Y���C�L�@B�.���$��F���wZG�C�Z�����'tEJ"d>a"M�J\i���C���=yq����4yS.R !��4�P0�����d�������!��+'2N�
[��I����-����=3��.xk��4���M`!	�
v����1�e���~>�LU��o���S�����W7:����-�=�G��K�G�3 �
���4�&wt�d�\����`��2C�x%,��$������qq��	�,j�t�N8�oCT7
B�����J"�l��~��t
���������a��������emFh����B7=��v\��BX������n�T��ID"���Mi {���p�i��ep,;�%AA��J?�utK�'�!T������wu��r�uK���	�����W�-��]j���!{]���{>�������m��=�x��=c��)�5��9��+�{�n�o�c��u��:�M��.C�q�o(vjt�]��Nu�N`z����D�����u�vMW�Y�sZ7&��g��UM�`�*u���D�up�E���������c0����,l����I{���$����)m�2K��,B�����a��La����.��B��c��A�������u��	���X�t>��GGof�`�V�6��y�����1�����e���8*[%�������
{�O��zG�����h��4�!8��t�-B���������<�W���|�e-c�W��t+�<*W�&��)��c��X�����`�n��@����.
s<I�C��^dW�c kB=�eB�
UI��E�&D�����Y�_8���n'7�bt�@��!+����l����t����������e�s=0���p�q��������CI2B���]��pX���6�t�3=����.����I?��Z�0�X?yU]z�!��.�"B�]]�'���l��(���� ��(\�V���PYJ��8��� 9b�9 ����QW��)����
W��Y�Y����"����.`�8��Y%�B�s3������t���>i��1S �"t�@�2�p?�(�����M><G	��~����~?�����@���(@h������`�AHr�*H���g�������C�WF���`O�ub����c4#O�'�-�p�F.E7���S������H��0��Jt���O�d�R,��u&c�~���%�f+�����u3o��������Oy0{�� �X�x��'��������}:��/��e� ���D�{M�
���:��C�Va������>�������1����N~J�H�<A��A�C���S�0*n�]�3|�������|����e���d����u�������������.+�gL������G���!���l����<]����RYH��"fux�%���
'@�P+��g����HC��=��L+�Y���2����]�������(p'^���&S>A���K��M�Vy��h4���]�z�!�K��-~|d[�@���������n����;��=�n{����_�0�Y�@	t�3�!O��D�i��|\�y������X�����*���;���e'�$�����4*�g��e?�A<��o#{���S���z�w$���`W�Y*������>����q��l����������D�@�(�B�.�J���-�dz����t��0L�t*^�����t�3)x��k0�O�B"H��!�~��q�uf)��z:D 0�@
'E��-r��w�p�#*��t��a�������y�kU~��1:U�Vv���������+�-���#����.�O�v���>���Dj��
��%����t�V�2���u^��F!'Yt{`�\�J�{��������t�7��i���@p,��uX
�Yb��
L���t�	���<�k�^�n�����PCw�I��pLx��i��na���m]�d�u������'���q>|�&�����+9�������/�����u	������`_e�>�'��|@�O�T��E�q�-U��3@��0L�����Hxe���������^C��9'�u��@H���~%�,���o��8�e-��p�e�cp���z1~���:�u����i���k��tF���p<��V��e?��}�����l��o�o��MG��!���:� 4��Wx��|��Z�����![���KGR.b��Uf��1\r�R��Y{N��'�n����
�a�
T�6����5����U2� ���."���Xw�����n��8��u�"��������iG�i��$<*U>��[�a����M>G]�����3�:��7RnZo���'\&B�_��LRW�$BF����g�uJ��s8�hR��� 	+��d7I�y;:� �f�42��$r�97��N���G]�����]]�����utZBv�.o��}{�� ��]u���<%�QER0��b y{),D{������w���V���������*��VN>����Q��[�E�!��e����B�L����i��/\����=w����^6���Z6��TF�z��{n<��j1�������tn�����A���I��R$C'u�;�@yt��H������<��-�6(�q�M�R�:d��8"u��@���A"����%"l���������u�5{O�2]3�Bd/�e��@������2aA����Il%[�*���V����Nnv���,+e�z����:���]����`H�d��@��������O�	�q8;t��!�|)���rd��5
)B��hT��8hYT\�����oppiW��������:p���W,!�9���;e�����>�t��*�������0.����t��@�����D����	6fZh�������g?�m���Hx�U8������:p$�,����F���vt�p@�����O-.�����%i'��p����w"u���\Sc�R�����$;i�%I�e�XS��S�L{��������cN��	D'u�V�o�H��a�}d�>�P�������L��uO�:����L��LA�I�w<ke�o@�d�M��@:	SzJ��_�D��!��N�h��s ��� �!l����0�[7k���8%�,^� p��!����ak��U����C?V ����=�IVo���l��:�����7,t�d�n�����B60����R��p�>�
"���K��ln��g�$pv��SB���KF���$���C#���q�&�ev��n��)�&a�a� ��D�7��5b?^M�_&yl�9���!�k����l2{����������3��T�q�%�%��,��������9�.�������9Y+�!%���->V0N�]���J��O��y��_B������dG�{�:J!C���5����d���Qf�
��a��W������yJv�����������m'�XU�W����SY�/�����d#����57���y2���ak�^T�@b���/a�D�dk�9���2q ��h��p��������W��5�8�C0N���	��8�	�u�7\'6u]��:�]���tq��'�X���J*y� �]8�~��!�4�c�u`m��Ju��/(��B������s��"�<��7�������Iy���%��^�}X�9\�,�'���J]�q~L��6��Cx��7V�0a���	��raB���X'��2F
 |l�2��\��>]���j���$L��ua	n�F�>]Ez�>yW�)�E47M6���R\��XuI���OV �v�e�w]v��t�a@�c��+M��&t�f@�s���|Uq@p�Uw^T�M�`����R����7e�b g�l��n>���,��g�d!k]2
��;�����eS|@`�8+[��;���-|K�d�r�2�w�}`�e�d���9P�����=s��+�B��B6xU�^�p!������ ����lH���C�!	�IY�X�,�m:��z}aX}�Fi������R����-�#$�+[g&G���WV�)c�8���R
$���U}�Xyt�����EqC�R����M��!	m*�n�����
�[zu:w��^���m�7Dgbe]2�R������b��B�@�y�Z����"��'��`�6��S <��q-}f��0*�K��&��hl;�_R�E���3,���B�%�A&��r,������%t���umD ��:�����'L�2�)���L��^�"{���"$�uo��`��yY �A�����zj�}���$�S%�a��+;|G'6S6�
�{�����o���!�
�:���r���k*7���Q�q���9��u���]��(v�'MW�-$���y��a+u�u���a4���+�{�W:�>���*2�s�sa��)�,��>aa�C���,h��\2���"��G���ra�Tu��c�����[���_+��|�e�����<K�B���N�I������bKX�����k����|+���Y��DI�Y�I���Z;�F76F�����������p�9t�
�������}��@�d�u=��`5�	���;�x�.�'B�k]4��W]��T�{��
�M���
���M��}��1���� ���]]m	����D��>�4����L7����ORFBA��sy���l�&t�zskZ��
�:�u�B�p����L��G��1o��[F�xS�G���-�o�j�Atc�D��2R~V��r;�su����:n��F�a88�3�En)��U��d���4�,�dv��5S�%k%a'qdv����8�d��b�:����������n:�a���x7����U.�C�t�"��2��U��
��y
���h����@�9.��l����|1BTd��
Q����#���dN{#�qk�|e2�k ��sF6�����8JG�T����#Ku?�O�Pv����:��8��y��{�������0��&�2nn="�
YC��0��5
�����G]w���kd5���9eO����NZ�C_���X�a� \������0�~�����Pj]��_��m���u��2�i�tM�"����P�]*�!���{��|�6_�lU�0(�&\���f�����Bz2�?�xO��'�,&���Zn���)]j���
@����S�����v��?���:&����juM�����5d�2T�>/���>% P)^�w B5�t���Mf�6�5K]0���T��v���'2f9 ����z���]�8n�r/��X���Z�!k�V��-�0�W���2�����%�2��!���S���Q�.h
��.�	d�4TG�K��O�V&�Bv�C�3]������5�3$W7�t^��k2�N p�j]xH�*��b��G�������+��ao�F�@���uf�
�L�������q������C*�C�3��r7b�8��PD���Aw�����:���Q{[�D���J7K�!���������K?���������	(nwK����]���)!,�r��:��N2�W�.K$�U]���ri��F��$]LY������D�Wy�!�G:�;]N�����L9ISL���������x�g��FE��S���V���\��g�������J.����V��m�.�[G���[
��BL�B����C��1��>C�!�2� ,�
�����c����Z����2�Az��e����8��O7My��k��^�c�����)���h���1�I��not�@��������]�&s���zE������9�`�:%��sG����V�K��b[��+j4������D Y��1��5��+��?����������x�5����P�t[a���HPPBs��[�n��������k�0��	���^$���K��$
��^"��n@Q�W����oI'�
���*d9m�'�V�3�(�yuSEDh�W�5H��e�`����U��x���|�O�K	�������m]_�)���R7k��)�p�QbCT���.���A�����~��
k�(Y�CX��2���S��+�U���n�t��/��hl.�d5�yN����
�d ���K�"a�6���J�!p����p5qOpXFw���
�Mg��X��0�q���II
'����&��Pd���(Oq�
�p|B��\��\W�C'�V�����7#�!�o���n�A��eW�a����`�RO��~�B�+ ���	2�I�e�,�=k:y�f�}2�3������!p��,3@����7  V2�KN9>��L�E�v�Gp���e�?@@���+�s��LM�:?�A�V����:�������#E&u����e��������1��:�d ��I'~�"�a�PO��v���}B�G(����B�u:C'?�!,j���p����B�DH\4���"�C�&\��H��]��p�a�f�R���o)u��@X���z3C)�~��{�H�����<�������
I��C�1�0�'B��m0
���������QGH#���&�l��X"�2O�?���P����#���������v?0{�"�71K�\�A7����P$�0�\Y"�dT��}�������l�������.#>�<�c
b
�3��7���P�����~�R�O�R>*O�[����2�������eM] ��5t588�S�Rg[��?�{KC��{u���{��+Z���Zg��[L����6����@R�
����@��'s�K.
8Q�i���{u%�8Qd�2����u�[�aq�d�`@hN��L{�a�V�"���7����HC��t���OSRGZ%r��r8���$ns�6L��[B��!�>%C��;h
z���R�C�C�\Q��G�)q���:i,�ka(�A���eX�y�cQ:����Z&���@!�3�n�=�u%��o�e�d@���ZRk��G�.�}e�B=]���?���S���U~�����[��������N��C����I���2�e���k"��N��C(�<�����v�z^���r������U)LM���u
�]&p�����06��
�''����k�e�@�^�zU��C��J{g4V��>N]��K@���KY�{��#��� ���~�����%E@�-s�a�������o���|:����g���q�Z%�����@�X��H}�pO�0��l����b�{S�-��N����k���z�[��+��wX�Z?���*��zu�-�3}u�;@h���7 t�N��"+�����H�d]��k�ltb�@����]V(�����-#�3��0����|2O��p��4��7A�2����7�eKdY�����K}'O��Y$oH��d������A�-d^�q���dc,@ !e�*Kk����dd�k���]�a����Xf����=p�;O��[`�3�>V�:�C�l��h �{)3|�����%@����N�b
�w���:�N:{7]���Hc�u1%1�=]����\�2��u���2����RW��Z�
���J��U�D�8�E2
��
��-dh]��=����_��?�@����
v��Pw�h��L�s�9�����uU�u�o Y`G6��ZN�u[�;C�M|�2�������i���n�x�Q����_-����+��wp������[GD������|�e�=�j�.k�#��B�%���2/G���Z2�9x����&D��{���~�~S�-��ndR!���~�Y��y�����=�]Q})dx�2 C%��H��F���D�]7*�����M��+�5��e��(
)�JG��!�-	���=&[l�o�[X���:��tV	���+���d����KI2�[3aD�9�
�!��}�M8��������������p�0��B��
&G����"�:`\W�����r�����*?�����q���^�(�	��%���~ �|dX��U�N��r�C<`Z�6A	��ql���WW>!�P�SV>� #��_�!^�& �nI���}�%T�!p��:V���>��y�&k��@L����w	u��?a�J�������4�Wz��a��[$3#sYfD�>L����[�#�;Vse���p���l��h7u�.�����F�	��Q���������5�(*\]w;c���h���i��jB��
�!��%e�%?U���A��i6��@��_f��b����"!`ig�YI%i���/�R��)�����c} D!����(#O��<���f8�M>��z����_
��+����!���q��f�l����w��\�������;B
�<�!���N����ye;V��e��@�np'pd\�(��^��' X�����Un�$s������
���"�����!B�T���)�w���:�p(��������:!vc�^��Ay<��D#�i����}Z�,(c���- �g���7"��Ha�j���'|��$����w�;2y{B�2�t�z@�N[��nL��Na�H�+����$haL�F��Q����h
o�3���#}F6"�A�^��Bls9��J�py�"��6�<�Q>���
�����:��G����@X$r���)�i>��0Q����Is���s�+��������-���?X��8qqe%ls��4|+MT�L5��Q!
�p����5�\,!�8bA�T!�F�	C���^��L#���P�h�13/��|8�Fx$�x8	�{�9��!�"�$�dK�?��,�Fn���i	��~����	aj��k*��������A�o���Y���K���� �w����s��	k���i����$������{2q���8��FX&�w��n]�y�<$���
���$iU�N���O�M��a{�.CzH*�|e[������dOa��gu��2��(�c��=�
��E��	g��TM�]��n��+��U��u*f<�{6KGI�}J�?�uc|H��c�������bS�Y�*��.���<v�hum�R����n����&�~�=��2�h'B��b�E��2�@ �m\m��0 H����x�
p��	��j{�:#�-�p��"w;���grW�E�EM��
��+�V�d��R�Q���
����D�W�^�t=I@0���
N����l9���f�������{Q��S<�2&�0��p0Y���a:�����Y�Ma�(�Ox/�Q;7\F��6jm"�BpL&�0�"���Ej�������`�x)���0��������W7���t������X�n?"�e��P�R9GV�����t�D�t��KBN��N��{hu�s�C0��g� ��t�G��D���� %&[��!%Rbd/�:���F7rI��H��9.��5��[�X3>d�J��,�^jJ
��N������
�w�q:�l�����������uK��,��]���"���D[do�-r��Du	���<��a�����u����p#,
���a��?p���w��5�/�.�0s�c�"L��6�]��������@��Q�/J��Dq����l�Mk�D<Nr��p��O�0�!�j�����)�p����|1���KG��lV0�p��4����n �4Bf�Y�D�Z��m� ����Vc*(B���'D�kH���yOa�<��N3��^�m���db�����V��S&��BF��e��a�t�c+�=�Y���H����8y_���XA������DR(eu�r@�^t?]��7��r�jy���;:��v�[�U��#GDJ���#]/�]�`��>�B�)��6B� �@(����3����U�L!���>�*"� �I��,�8���Y�i8�d�J����S`�^�\��1�k=J��isd���*@B5a�e��W��-��o�����������Ur��L�dq�*u�H"�W����:�_���bO���[@�����3������h�)�9�z_�����z�Q������}�eB;��)�t�*@4�R�U7���[,�+Z�E��Z!&�z)�^��vYr�g�$�,���P�q���X��E>!���Y�(!�������<����R�_U�j����H)��N3��>R��N|��$u����J��� ��m����������>(s?7't�* ��-��ZpQY���X?��:�S@�����XOC�B�����X!G�r����|2<h��.�X2���}��1�"G6�MV?~#d�a�CqY��zFD6*c�~���2�
�z<�7 ����/
AZ������`l���ua����{��/��K\���kf���������Y�f�lx����1N')-�z��n�,el
�`z�Y)B|d�R��U��s��0���,�!���-��J!�=de`@�\C.�|'���#<��m^� ��r��E���k�hu\���-�A��N�n���������-m�<�9��������['������$�-c #���ye4��XjW�5_j��/B��G�����1=�����h�����BD��zay��l��u�lNB��i�d��Hr���/�����5!�bC��`��=k��q90�����, ����bp��~!�p}M&�b\�9T�Z(�<A�
F�J,:����HYK���	a�+�Iq[��;�&4P|M�[�D9���0���C�i�'����+��"p��B��wBX9 �L�l��K�:.@�C�wC��pu��l;�";L65M���2�#���_�E!��M��!�P&S���3F�:=C���!��l�
��0a��|��")��P��fI��	 ��)Za���3����X�N��Z#�U��`\<��+�&������A������9��dH����i�i@����:3�$�5��5�@�3,��zB4>�����/s�����I6o��SM9G��z��ve29�`'��0�l�� �v�;):�#�H�T�*
!C�*;�d��N4��-u�s�b�l��6�����	����nn�<S���G*��f�s��m�F���_h��2�u��8�@�w��A������ R�'��k0��^7��!�X{	K,q&[Y����BZ�9�2��#&\��!��t�drj�^aH;�.E�*9n�Z�Y;���"C�0a`>K�(��qw�t<�b: ��w�_]���^�F�l�����py��#l�!%��E��GM9����<W9@6�X$|:w�igIj-|��t�+?��s�&��.^�	���E�|C��I��:��4�L�2tt
����S�8	!(�)?Q�:n0B������{�����[AnDdH�c���!�����:�i&"(k�<n�Yd�:ETBp��L�J���GF�HR��pr���a���o�bf�C[��Y��������w(���:��n�n*���8�V�%Nf�G8 �,�t�E�`�'�������`�u�����d��C
�*��R03x����if���u�mu�-���?���TM��M�p��#�RZRwQ������������������o�����������?�����/�������_�����~��?�����������~�����_�����/����_����o�������?�o�������������_��_�������~�����o����~������_��?������?�����/��?����K���?�����/�/����������	��������o�X��o����
���������������������������?�����/�������PK���BD�x#PKIq�P�l9�..mimetypePKIq�P(��p$p$TThumbnails/thumbnail.pngPKIq�P��,���2�$settings.xmlPKIq�P��h���)manifest.rdfPKIq�P+Configurations2/menubar/PKIq�P6+Configurations2/floater/PKIq�Pl+Configurations2/toolpanel/PKIq�P�+Configurations2/statusbar/PKIq�P�+Configurations2/accelerator/PKIq�P,Configurations2/popupmenu/PKIq�PN,Configurations2/images/Bitmaps/PKIq�P�,Configurations2/toolbar/PKIq�P�,Configurations2/progressbar/PKIq�Pr�'^��,meta.xmlPKIq�P�b��(
�.styles.xmlPKIq�P�u��1,�5META-INF/manifest.xmlPKIq�P���BD�x#>7content.xmlPKe�{

run-pgbench.shapplication/x-shDownload

#289

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#288)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:

Hi,

I've pushed the fist part of this patch series - I've reorganized it a
bit by moving the add_partial_path changes to the end. That way I've
been able to add regression test demonstrating impact of the change on
plans involving incremental sort nodes (which wouldn't be possible when
committing the add_partial_path first). I'll wait a bit before pushing
the two additional parts, so that if something fails we know which bit
caused it.

Hmmm, I see the buildfarm is not happy about it - a couple of animals
failed, but some succeeded. The failure seems like a simple difference
in explain output, but it's not clear why would it happen (and I've
ran the tests many times but never seen this failure).

Investigating.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#290

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Tomas Vondra (#289)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Hmmm, I see the buildfarm is not happy about it - a couple of animals
failed, but some succeeded. The failure seems like a simple difference
in explain output, but it's not clear why would it happen (and I've
ran the tests many times but never seen this failure).

Did you ever use force_parallel_mode = regress?

regards, tom lane

#291

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#290)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 04:14:38PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

Hmmm, I see the buildfarm is not happy about it - a couple of animals
failed, but some succeeded. The failure seems like a simple difference
in explain output, but it's not clear why would it happen (and I've
ran the tests many times but never seen this failure).

Did you ever use force_parallel_mode = regress?

Ah, not sure - probably not in this round of tests and there were some
changes in the explain code. Thanks for the hint.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#292

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Tomas Vondra (#291)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Mon, Apr 06, 2020 at 04:14:38PM -0400, Tom Lane wrote:

Did you ever use force_parallel_mode = regress?

Ah, not sure - probably not in this round of tests and there were some
changes in the explain code. Thanks for the hint.

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members). At a guess, looks like missing or incorrect logic
for propagating some state back from parallel workers.

regards, tom lane

#293

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#292)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#294

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Alvaro Herrera (#293)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

thanks

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#295

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#294)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 6, 2020 at 5:12 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

I'm stepping through this in a debugger; is what you're considering
that the for loop through the workers is off by one?

James

#296

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#295)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 6, 2020 at 5:20 PM James Coleman <jtc331@gmail.com> wrote:

On Mon, Apr 6, 2020 at 5:12 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

I'm stepping through this in a debugger; is what you're considering
that the for loop through the workers is off by one?

Oh, nevermind, misread that.

Looks like if the leader doesn't participate, then we don't show
details for workers.

Tomas: Do you already have a patch? If not, I can work one up.

James

#297

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#296)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 6, 2020 at 5:22 PM James Coleman <jtc331@gmail.com> wrote:

On Mon, Apr 6, 2020 at 5:20 PM James Coleman <jtc331@gmail.com> wrote:

On Mon, Apr 6, 2020 at 5:12 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

I'm stepping through this in a debugger; is what you're considering
that the for loop through the workers is off by one?

Oh, nevermind, misread that.

Looks like if the leader doesn't participate, then we don't show
details for workers.

Tomas: Do you already have a patch? If not, I can work one up.

Well, already have it, so I'll send it just in case.

James

Attachments:

fix_explain_parallel.patchtext/x-patch; charset=US-ASCII; name=fix_explain_parallel.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62c86ecdc5..c31f3e0987 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2880,19 +2880,22 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 
 	fullsortGroupInfo = &incrsortstate->incsort_info.fullsortGroupInfo;
 
-	if (!(es->analyze && fullsortGroupInfo->groupCount > 0))
+	if (!es->analyze)
 		return;
 
-	show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
-	prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
-	if (prefixsortGroupInfo->groupCount > 0)
+	if (fullsortGroupInfo->groupCount > 0)
 	{
+		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
+		prefixsortGroupInfo = &incrsortstate->incsort_info.prefixsortGroupInfo;
+		if (prefixsortGroupInfo->groupCount > 0)
+		{
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, " ");
+			show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+		}
 		if (es->format == EXPLAIN_FORMAT_TEXT)
-			appendStringInfo(es->str, " ");
-		show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+			appendStringInfo(es->str, "\n");
 	}
-	if (es->format == EXPLAIN_FORMAT_TEXT)
-		appendStringInfo(es->str, "\n");
 
 	if (incrsortstate->shared_info != NULL)
 	{
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index 3b359efa29..c46443358a 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -1,3 +1,5 @@
+set force_parallel_mode = regress;
+
 -- When we have to sort the entire table, incremental sort will
 -- be slower than plain sort, so it should not be used.
 explain (costs off)

#298

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#294)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

OK, I've pushed a fix - this should make the buildfarm happy again.

It however seems to me a bit more needs to be done. The fix makes
show_incremental_sort_info closer to show_sort_info, but not entirely
because IncrementalSortState does not have sort_Done flag so it still
depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
before, but not having that flag seems a bit weird to me.

It also seems possibly incorrect - we may end up with

fullsortGroupInfo->groupCount == 0
prefixsortGroupInfo->groupCount > 0

but we won't print anything.

James, any opinion on this? I'd say we should restore the sort_Done flag
and make it work as in plain Sort. Or some comment explaining why
depending on the counts is OK (assuming it is).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#299

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#298)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

OK, I've pushed a fix - this should make the buildfarm happy again.

It however seems to me a bit more needs to be done. The fix makes
show_incremental_sort_info closer to show_sort_info, but not entirely
because IncrementalSortState does not have sort_Done flag so it still
depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
before, but not having that flag seems a bit weird to me.

It also seems possibly incorrect - we may end up with

fullsortGroupInfo->groupCount == 0
prefixsortGroupInfo->groupCount > 0

but we won't print anything.

This shouldn't ever be possible, because the only way we get any
prefix groups at all is if we've already sorted a full sort group
during the mode transition.

James, any opinion on this? I'd say we should restore the sort_Done flag
and make it work as in plain Sort. Or some comment explaining why
depending on the counts is OK (assuming it is).

There's previous email traffic on this thread about that (I can look
it up later this evening), but the short of it is that I believe that
relying on the group count is actually more correct than a sort_Done
flag in the case of incremental sort (in contrast to regular sort).

James

#300

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Tomas Vondra (#298)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

OK, I've pushed a fix - this should make the buildfarm happy again.

Well, it's *less* unhappy. thorntail is showing that the number of
workers field is not stable; that will need to be masked.

regards, tom lane

#301

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#299)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:

On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

OK, I've pushed a fix - this should make the buildfarm happy again.

It however seems to me a bit more needs to be done. The fix makes
show_incremental_sort_info closer to show_sort_info, but not entirely
because IncrementalSortState does not have sort_Done flag so it still
depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
before, but not having that flag seems a bit weird to me.

It also seems possibly incorrect - we may end up with

fullsortGroupInfo->groupCount == 0
prefixsortGroupInfo->groupCount > 0

but we won't print anything.

This shouldn't ever be possible, because the only way we get any
prefix groups at all is if we've already sorted a full sort group
during the mode transition.

James, any opinion on this? I'd say we should restore the sort_Done flag
and make it work as in plain Sort. Or some comment explaining why
depending on the counts is OK (assuming it is).

There's previous email traffic on this thread about that (I can look
it up later this evening), but the short of it is that I believe that
relying on the group count is actually more correct than a sort_Done
flag in the case of incremental sort (in contrast to regular sort).

OK. Maybe we should add a comment to explain.c saying it's OK.

I've pushed a fix for failures due to different planned workers (in the
test I added to show changes due to add_partial_path tweaks).

It seems we're not out of the woods yet, though. rhinoceros and
sidewinder failed with something like this:

Sort Method: quicksort Memory: NNkB
+ Sort Method: unknown Disk: NNkB

Would you mind investigating at it?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#302

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#300)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 05:51:43PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

OK, I've pushed a fix - this should make the buildfarm happy again.

Well, it's *less* unhappy. thorntail is showing that the number of
workers field is not stable; that will need to be masked.

Yeah, I've already pushed a fix for that. But there seems to be another
failure in th explain output. Looking.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#303

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Tomas Vondra (#302)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Mon, Apr 06, 2020 at 05:51:43PM -0400, Tom Lane wrote:

Well, it's *less* unhappy. thorntail is showing that the number of
workers field is not stable; that will need to be masked.

Yeah, I've already pushed a fix for that. But there seems to be another
failure in th explain output. Looking.

I'm kind of unimpressed with that fix, because it guarantees that if
there's any problem with more than 2 workers, this test will never
find it. I think you should do what I said and arrange to replace
the number-of-workers output with "N".

regards, tom lane

#304

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#303)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 06:34:04PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Mon, Apr 06, 2020 at 05:51:43PM -0400, Tom Lane wrote:

Well, it's *less* unhappy. thorntail is showing that the number of
workers field is not stable; that will need to be masked.

Yeah, I've already pushed a fix for that. But there seems to be another
failure in th explain output. Looking.

I'm kind of unimpressed with that fix, because it guarantees that if
there's any problem with more than 2 workers, this test will never
find it. I think you should do what I said and arrange to replace
the number-of-workers output with "N".

I can do that, but isn't this how every other regression test does it?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#305

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#301)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 6, 2020 at 6:13 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:

On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

OK, I've pushed a fix - this should make the buildfarm happy again.

It however seems to me a bit more needs to be done. The fix makes
show_incremental_sort_info closer to show_sort_info, but not entirely
because IncrementalSortState does not have sort_Done flag so it still
depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
before, but not having that flag seems a bit weird to me.

It also seems possibly incorrect - we may end up with

fullsortGroupInfo->groupCount == 0
prefixsortGroupInfo->groupCount > 0

but we won't print anything.

This shouldn't ever be possible, because the only way we get any
prefix groups at all is if we've already sorted a full sort group
during the mode transition.

James, any opinion on this? I'd say we should restore the sort_Done flag
and make it work as in plain Sort. Or some comment explaining why
depending on the counts is OK (assuming it is).

There's previous email traffic on this thread about that (I can look
it up later this evening), but the short of it is that I believe that
relying on the group count is actually more correct than a sort_Done
flag in the case of incremental sort (in contrast to regular sort).

OK. Maybe we should add a comment to explain.c saying it's OK.

I've pushed a fix for failures due to different planned workers (in the
test I added to show changes due to add_partial_path tweaks).

It seems we're not out of the woods yet, though. rhinoceros and
sidewinder failed with something like this:

Sort Method: quicksort Memory: NNkB
+ Sort Method: unknown Disk: NNkB

Would you mind investigating at it?

I assume that means those build farm members run with very low
work_mem? Is it an acceptable fix to adjust work_mem up a bit just for
these tests? Or is that bad practice and these are to expose issues
with changing into disk sort mode?

James

#306

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: James Coleman (#305)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 6, 2020 at 7:09 PM James Coleman <jtc331@gmail.com> wrote:

On Mon, Apr 6, 2020 at 6:13 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:

On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

OK, I've pushed a fix - this should make the buildfarm happy again.

It however seems to me a bit more needs to be done. The fix makes
show_incremental_sort_info closer to show_sort_info, but not entirely
because IncrementalSortState does not have sort_Done flag so it still
depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
before, but not having that flag seems a bit weird to me.

It also seems possibly incorrect - we may end up with

fullsortGroupInfo->groupCount == 0
prefixsortGroupInfo->groupCount > 0

but we won't print anything.

This shouldn't ever be possible, because the only way we get any
prefix groups at all is if we've already sorted a full sort group
during the mode transition.

James, any opinion on this? I'd say we should restore the sort_Done flag
and make it work as in plain Sort. Or some comment explaining why
depending on the counts is OK (assuming it is).

There's previous email traffic on this thread about that (I can look
it up later this evening), but the short of it is that I believe that
relying on the group count is actually more correct than a sort_Done
flag in the case of incremental sort (in contrast to regular sort).

OK. Maybe we should add a comment to explain.c saying it's OK.

I've pushed a fix for failures due to different planned workers (in the
test I added to show changes due to add_partial_path tweaks).

It seems we're not out of the woods yet, though. rhinoceros and
sidewinder failed with something like this:

Sort Method: quicksort Memory: NNkB
+ Sort Method: unknown Disk: NNkB

Would you mind investigating at it?

I assume that means those build farm members run with very low
work_mem? Is it an acceptable fix to adjust work_mem up a bit just for
these tests? Or is that bad practice and these are to expose issues
with changing into disk sort mode?

On rhinoceros I see:

================== pgsql.build/src/test/regress/regression.diffs
===================
diff -U3 /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/expected/subselect.out
/opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/results/subselect.out
--- /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/expected/subselect.out
2020-03-14 10:37:49.156761104 -0700
+++ /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/results/subselect.out
2020-04-06 16:01:13.766798059 -0700
@@ -1328,8 +1328,9 @@
          ->  Sort (actual rows=3 loops=1)
                Sort Key: sq_limit.c1, sq_limit.pk
                Sort Method: top-N heapsort  Memory: xxx
+               Sort Method: unknown  Disk: 0kB
                ->  Seq Scan on sq_limit (actual rows=8 loops=1)
-(6 rows)
+(7 rows)

Same on sidewinder.

Given the 0kB I'm not sure this is *just* a work_mem thing, though
that's still something I'm curious to know about, and it's still part
of the "problem" here.

I'm investigating further.

James

#307

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#305)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 07:09:11PM -0400, James Coleman wrote:

On Mon, Apr 6, 2020 at 6:13 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:

On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

OK, I've pushed a fix - this should make the buildfarm happy again.

It however seems to me a bit more needs to be done. The fix makes
show_incremental_sort_info closer to show_sort_info, but not entirely
because IncrementalSortState does not have sort_Done flag so it still
depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
before, but not having that flag seems a bit weird to me.

It also seems possibly incorrect - we may end up with

fullsortGroupInfo->groupCount == 0
prefixsortGroupInfo->groupCount > 0

but we won't print anything.

This shouldn't ever be possible, because the only way we get any
prefix groups at all is if we've already sorted a full sort group
during the mode transition.

James, any opinion on this? I'd say we should restore the sort_Done flag
and make it work as in plain Sort. Or some comment explaining why
depending on the counts is OK (assuming it is).

There's previous email traffic on this thread about that (I can look
it up later this evening), but the short of it is that I believe that
relying on the group count is actually more correct than a sort_Done
flag in the case of incremental sort (in contrast to regular sort).

OK. Maybe we should add a comment to explain.c saying it's OK.

I've pushed a fix for failures due to different planned workers (in the
test I added to show changes due to add_partial_path tweaks).

It seems we're not out of the woods yet, though. rhinoceros and
sidewinder failed with something like this:

Sort Method: quicksort Memory: NNkB
+ Sort Method: unknown Disk: NNkB

Would you mind investigating at it?

I assume that means those build farm members run with very low
work_mem? Is it an acceptable fix to adjust work_mem up a bit just for
these tests? Or is that bad practice and these are to expose issues
with changing into disk sort mode?

I don't think so - I don't see any work_mem changes in the config - see
the extra_config at the beginning of the page with details:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2020-04-06%2023%3A00%3A16

Moreover, this seems to be in regular Sort, not Incremental Sort and it
very much seems like it gets confused to print a worker info because the
only way for Sort to print two "Sort Method" lines seems to be to enter
either both

if (sortstate->sort_Done && sortstate->tuplesortstate != NULL)
{
... print leader info ...
}

and

if (sortstate->shared_info != NULL)
{
for (n = 0; n < sortstate->shared_info->num_workers; n++)
{
... print worker info ...
}
}

or maybe there are two workers? It's strange ...

It doesn't seem to be particularly platform-specific, but I've been
unable to reproduce it so far. It seems on older gcc versions, though.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#308

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Tomas Vondra (#307)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

It doesn't seem to be particularly platform-specific, but I've been
unable to reproduce it so far. It seems on older gcc versions, though.

It's looking kind of like an uninitialized-memory problem. Note
the latest from spurfowl,

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=spurfowl&dt=2020-04-07%2000%3A15%3A05

which got through "make check" and then failed during pg_upgrade's
repetition of the test. Similarly on rhinoceros. So there's definitely
instability there even on one machine.

Perhaps something to do with unexpected cache flushes??

regards, tom lane

#309

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#307)

2 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 6, 2020 at 7:31 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 07:09:11PM -0400, James Coleman wrote:

On Mon, Apr 6, 2020 at 6:13 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:

On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

OK, I've pushed a fix - this should make the buildfarm happy again.

It however seems to me a bit more needs to be done. The fix makes
show_incremental_sort_info closer to show_sort_info, but not entirely
because IncrementalSortState does not have sort_Done flag so it still
depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
before, but not having that flag seems a bit weird to me.

It also seems possibly incorrect - we may end up with

fullsortGroupInfo->groupCount == 0
prefixsortGroupInfo->groupCount > 0

but we won't print anything.

This shouldn't ever be possible, because the only way we get any
prefix groups at all is if we've already sorted a full sort group
during the mode transition.

James, any opinion on this? I'd say we should restore the sort_Done flag
and make it work as in plain Sort. Or some comment explaining why
depending on the counts is OK (assuming it is).

There's previous email traffic on this thread about that (I can look
it up later this evening), but the short of it is that I believe that
relying on the group count is actually more correct than a sort_Done
flag in the case of incremental sort (in contrast to regular sort).

OK. Maybe we should add a comment to explain.c saying it's OK.

I've pushed a fix for failures due to different planned workers (in the
test I added to show changes due to add_partial_path tweaks).

It seems we're not out of the woods yet, though. rhinoceros and
sidewinder failed with something like this:

Sort Method: quicksort Memory: NNkB
+ Sort Method: unknown Disk: NNkB

Would you mind investigating at it?

I assume that means those build farm members run with very low
work_mem? Is it an acceptable fix to adjust work_mem up a bit just for
these tests? Or is that bad practice and these are to expose issues
with changing into disk sort mode?

I don't think so - I don't see any work_mem changes in the config - see
the extra_config at the beginning of the page with details:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2020-04-06%2023%3A00%3A16

Moreover, this seems to be in regular Sort, not Incremental Sort and it
very much seems like it gets confused to print a worker info because the
only way for Sort to print two "Sort Method" lines seems to be to enter
either both

if (sortstate->sort_Done && sortstate->tuplesortstate != NULL)
{
... print leader info ...
}

and

if (sortstate->shared_info != NULL)
{
for (n = 0; n < sortstate->shared_info->num_workers; n++)
{
... print worker info ...
}
}

or maybe there are two workers? It's strange ...

It doesn't seem to be particularly platform-specific, but I've been
unable to reproduce it so far. It seems on older gcc versions, though.

I haven't been able to reproduce it, but I'm 99% confident this will fix it:

-            if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+            if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+                || sinstrument->sortMethod == NULL)
                 continue;       /* ignore any unfilled slots */

Earlier we'd had this discussion about why SORT_TYPE_STILL_IN_PROGRESS
was explicitly set to 0 in the enum declaration. Since there was no
comment, we changed that, but here I believe that show_sort_info was
relying on that as an indicator that a worker didn't actually do any
work (since the DSM for the sort node gets set to all zeros, this
would work).

I'm not sure if the SORT_TYPE_STILL_IN_PROGRESS case is actually still
needed, though.

I've attached both a fix for this issue and a comment for the
full/prefix sort group if blocks.

James

Attachments:

v1-0002-Comment-show_incremental_sort_info-assumtions.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Comment-show_incremental_sort_info-assumtions.patchDownload

From 963cb6ed7d27cd7112c4fc4b4fb138b63edf087e Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Mon, 6 Apr 2020 20:45:57 -0400
Subject: [PATCH v1 2/2] Comment show_incremental_sort_info assumtions

It's not immediately obvious when reading this code that ignoring the
prefix group info is correct if there are no full groups, so add a
comment explaining the rationale.
---
 src/backend/commands/explain.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b1a20eba27..b8fd542c9b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2884,6 +2884,15 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 	if (!es->analyze)
 		return;
 
+	/*
+	 * Since we never have any prefix groups unless we've first sorted a full
+	 * groups and transitioned modes (copying the tuples into a prefix group),
+	 * we don't need to do anything if there were 0 full groups.
+	 *
+	 * We still have to continue after this block if there are no full groups,
+	 * though, since it's possible that we have workers that did real work even
+	 * if the leader didn't participate.
+	 */
 	if (fullsortGroupInfo->groupCount > 0)
 	{
 		show_incremental_sort_group_info(fullsortGroupInfo, "Full-sort", true, es);
@@ -2915,6 +2924,13 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 			 */
 			fullsortGroupInfo = &incsort_info->fullsortGroupInfo;
 			prefixsortGroupInfo = &incsort_info->prefixsortGroupInfo;
+
+			/*
+			 * Since we never have any prefix groups unless we've first sorted
+			 * a full groups and transitioned modes (copying the tuples into a
+			 * prefix group), we don't need to do anything if there were 0 full
+			 * groups.
+			 */
 			if (fullsortGroupInfo->groupCount == 0 &&
 				prefixsortGroupInfo->groupCount == 0)
 				continue;
-- 
2.17.1

v1-0001-Don-t-show-worker-info-for-sort-node-if-no-work-d.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Don-t-show-worker-info-for-sort-node-if-no-work-d.patchDownload

From b1c0f842f46943de0035e4658e641a20522951fa Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Mon, 6 Apr 2020 20:38:18 -0400
Subject: [PATCH v1 1/2] Don't show worker info for sort node if no work done

In d2d8a229bc58a2014dce1c7a4fcdb6c5ab9fb8da we modified the
TuplesortMethod enum to be a bitmask. As such,
SORT_TYPE_STILL_IN_PROGRESS is now set to one. It'd previously been
explicitly set to 0, but with no comment, so that seemed reasonable.
However it seems that explain.c's show_sort_info method was using that 0
as a sentinel that the worker hadn't done any work, and thus it
shouldn't be printed (the worker's DSM is initialized to all zeros, so
this worked properly). Instead it should explicitly check for NULL.
---
 src/backend/commands/explain.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index cad10662bb..b1a20eba27 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2717,7 +2717,8 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 			long		spaceUsed;
 
 			sinstrument = &sortstate->shared_info->sinstrument[n];
-			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+				|| sinstrument->sortMethod == NULL)
 				continue;		/* ignore any unfilled slots */
 			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
 			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
-- 
2.17.1

#310

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#306)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 07:27:19PM -0400, James Coleman wrote:

On Mon, Apr 6, 2020 at 7:09 PM James Coleman <jtc331@gmail.com> wrote:

On Mon, Apr 6, 2020 at 6:13 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:

On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:

On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:

On 2020-Apr-06, Tom Lane wrote:

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical. I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.

OK, I've pushed a fix - this should make the buildfarm happy again.

It however seems to me a bit more needs to be done. The fix makes
show_incremental_sort_info closer to show_sort_info, but not entirely
because IncrementalSortState does not have sort_Done flag so it still
depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
before, but not having that flag seems a bit weird to me.

It also seems possibly incorrect - we may end up with

fullsortGroupInfo->groupCount == 0
prefixsortGroupInfo->groupCount > 0

but we won't print anything.

This shouldn't ever be possible, because the only way we get any
prefix groups at all is if we've already sorted a full sort group
during the mode transition.

James, any opinion on this? I'd say we should restore the sort_Done flag
and make it work as in plain Sort. Or some comment explaining why
depending on the counts is OK (assuming it is).

There's previous email traffic on this thread about that (I can look
it up later this evening), but the short of it is that I believe that
relying on the group count is actually more correct than a sort_Done
flag in the case of incremental sort (in contrast to regular sort).

OK. Maybe we should add a comment to explain.c saying it's OK.

I've pushed a fix for failures due to different planned workers (in the
test I added to show changes due to add_partial_path tweaks).

It seems we're not out of the woods yet, though. rhinoceros and
sidewinder failed with something like this:

Sort Method: quicksort Memory: NNkB
+ Sort Method: unknown Disk: NNkB

Would you mind investigating at it?

I assume that means those build farm members run with very low
work_mem? Is it an acceptable fix to adjust work_mem up a bit just for
these tests? Or is that bad practice and these are to expose issues
with changing into disk sort mode?

On rhinoceros I see:
================== pgsql.build/src/test/regress/regression.diffs
===================
diff -U3 /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/expected/subselect.out
/opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/results/subselect.out
--- /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/expected/subselect.out
2020-03-14 10:37:49.156761104 -0700
+++ /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/results/subselect.out
2020-04-06 16:01:13.766798059 -0700
@@ -1328,8 +1328,9 @@
->  Sort (actual rows=3 loops=1)
Sort Key: sq_limit.c1, sq_limit.pk
Sort Method: top-N heapsort  Memory: xxx
+               Sort Method: unknown  Disk: 0kB
->  Seq Scan on sq_limit (actual rows=8 loops=1)
-(6 rows)
+(7 rows)
Same on sidewinder.

Given the 0kB I'm not sure this is *just* a work_mem thing, though
that's still something I'm curious to know about, and it's still part
of the "problem" here.

I'm investigating further.

I don't see how could this be caused by low work_mem value, really. What
I think it happening is that when executed with

force_parallel_mode = regress

we run this as if in a parallel worker, but then it gets somehow
confused and either (leader + worker) or two worker stats, or something
like that. I don't know why it's happening, though, or why would it be
triggered by the incremental sort patch ...

Actually, I just managed to trigger exactly this - the trick is that we
plan for certain number of workers, but then fail to start some. So for
example like this:

create table t (a int, b int, c int);

insert into t select mod(i,10), mod(i,10), i
from generate_series(1,1000000) s(i);

set force_parallel_mode = regress;

set max_parallel_workers = 0;

explain (costs off, analyze) select * from t order by a,b;

QUERY PLAN
---------------------------------------------------------------
Sort (actual time=0.010..0.010 rows=0 loops=1)
Sort Key: a, b
Sort Method: quicksort Memory: 25kB
Sort Method: unknown Disk: 0kB
-> Seq Scan on t (actual time=0.003..0.004 rows=0 loops=1)
Planning Time: 0.040 ms
Execution Time: 0.229 ms
(7 rows)

So we actually do have to print two lines, because without any workers
the leader ends up doing the work. But we don't know this is happening
because the the number of workers started is included in Gather node,
but force_parallel_mode=regress hides that.

The question is why are these failures correlated with incremental sort.
I don't think we've tweaked Sort at all, no?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#311

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#308)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 08:42:13PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

It doesn't seem to be particularly platform-specific, but I've been
unable to reproduce it so far. It seems on older gcc versions, though.

It's looking kind of like an uninitialized-memory problem. Note
the latest from spurfowl,

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=spurfowl&dt=2020-04-07%2000%3A15%3A05

which got through "make check" and then failed during pg_upgrade's
repetition of the test. Similarly on rhinoceros. So there's definitely
instability there even on one machine.

Perhaps something to do with unexpected cache flushes??

I don't know, I've tried running the tests on a number of machines,
similar to those failing. Rapsberry Pi, Fedora 31, ... and it worked
everywhere while the failures seem consistent.

I've been able to reproduce these failures (same symptoms) by making
sure the worker (implied by force_parallel_mode=regress) won't start.

set max_parallel_workers = 0;
set force_parallel_mode = regress;

triggers exactly those failures for me (at least during make check, I
haven't tried pg_upgrade tests etc.).

So my theory is that we fail to start parallel workers on those
machines. It's not clear to me why would it be limited to some machines
and why would it be correlated to the incremental sort? I don't think
those machines have lower number of parallel workers, no?

But maybe incremental sort allowed using more parallel queries for more
queries, and we simply run out of parallel workers that way?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#312

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Tomas Vondra (#311)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I don't know, I've tried running the tests on a number of machines,
similar to those failing. Rapsberry Pi, Fedora 31, ... and it worked
everywhere while the failures seem consistent.

On my machine, it reproduces about one time in six with
force_parallel_mode = regress. It seems possible given your
results that reducing max_parallel_workers would make it more
likely, but I've not tried that.

What I'm seeing, after adding some debug printouts, is that sortMethod is
frequently zero when we reach the EXPLAIN output for a worker. In many of
the tests this happens even though there is no visible failure, because
we've got a filter function hiding the output :-(

So I concur with James' conclusion that the existing code is relying on
sortMethod initializing to zeroes, and that we did the wrong thing by
trying to give SORT_TYPE_STILL_IN_PROGRESS a nonzero representation.
I do not like his patch though, particularly not the type pun with NULL.
I think the correct fix is to change the enum declaration.

I know it's darn late where you are, do you want me to change it?

regards, tom lane

#313

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tom Lane (#312)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 6, 2020 at 9:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I don't know, I've tried running the tests on a number of machines,
similar to those failing. Rapsberry Pi, Fedora 31, ... and it worked
everywhere while the failures seem consistent.

On my machine, it reproduces about one time in six with
force_parallel_mode = regress. It seems possible given your
results that reducing max_parallel_workers would make it more
likely, but I've not tried that.

What I'm seeing, after adding some debug printouts, is that sortMethod is
frequently zero when we reach the EXPLAIN output for a worker. In many of
the tests this happens even though there is no visible failure, because
we've got a filter function hiding the output :-(

So I concur with James' conclusion that the existing code is relying on
sortMethod initializing to zeroes, and that we did the wrong thing by
trying to give SORT_TYPE_STILL_IN_PROGRESS a nonzero representation.
I do not like his patch though, particularly not the type pun with NULL.

Sentinel and NULL? I hadn't caught that at all.

I think the correct fix is to change the enum declaration.

Hmm. I don't actually really like that, because it means the value
here isn't actually semantically correct. That is, the sort type is
not "in progress"; it's "we never started a sort at all". I don't
really love the conflating of those things that the old enum
declaration had (even it'd had a helpful comment). It seems to me that
we should make "we don't have a type" and "we have a type" distinct.

We could add a new enum value SORT_TYPE_UNINITIALIZED or similar though.

James

#314

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: James Coleman (#313)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

James Coleman <jtc331@gmail.com> writes:

On Mon, Apr 6, 2020 at 9:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think the correct fix is to change the enum declaration.

Hmm. I don't actually really like that, because it means the value
here isn't actually semantically correct. That is, the sort type is
not "in progress"; it's "we never started a sort at all".

Well, yeah, but that pre-dated this patch, and right now is no
time to improve it; we can debate such fine points at more leisure
once the buildfarm isn't broken.

Obviously the comment needs fixed...

regards, tom lane

#315

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tom Lane (#314)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 6, 2020 at 10:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

James Coleman <jtc331@gmail.com> writes:

On Mon, Apr 6, 2020 at 9:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think the correct fix is to change the enum declaration.

Hmm. I don't actually really like that, because it means the value
here isn't actually semantically correct. That is, the sort type is
not "in progress"; it's "we never started a sort at all".

Well, yeah, but that pre-dated this patch, and right now is no
time to improve it; we can debate such fine points at more leisure
once the buildfarm isn't broken.

Fair enough. Unsure if Tomas is still online to comment and/or push,
but reverting SORT_TYPE_STILL_IN_PROGRESS back to 0 works for me as an
initial fix.

Obviously the comment needs fixed...

The one in show_short_info?

I can work on that (and the other proposed cleanup above) with Tomas
tomorrow or later.

James

#316

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: James Coleman (#315)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

James Coleman <jtc331@gmail.com> writes:

Fair enough. Unsure if Tomas is still online to comment and/or push,
but reverting SORT_TYPE_STILL_IN_PROGRESS back to 0 works for me as an
initial fix.

I'm guessing he went to bed, so I'll push a fix in a moment.
The patch has survived enough test cycles here now to make me
moderately confident that it fixes the issue.

regards, tom lane

#317

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#316)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 10:19:41PM -0400, Tom Lane wrote:

James Coleman <jtc331@gmail.com> writes:

Fair enough. Unsure if Tomas is still online to comment and/or push,
but reverting SORT_TYPE_STILL_IN_PROGRESS back to 0 works for me as an
initial fix.

I'm guessing he went to bed, so I'll push a fix in a moment.
The patch has survived enough test cycles here now to make me
moderately confident that it fixes the issue.

Nope, how I could I sleep with half of the buildfarm still red?

I came to the same conclusion (that the change in TuplesortMethod
definiton is the culprit) a while ago and was about to push a fix that
initialized it correctly in ExecSortInitializeDSM. But I agree reverting
it back to the old definition is probably better.

Thanks!

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#318

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Tomas Vondra (#317)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I came to the same conclusion (that the change in TuplesortMethod
definiton is the culprit) a while ago and was about to push a fix that
initialized it correctly in ExecSortInitializeDSM. But I agree reverting
it back to the old definition is probably better.

Yeah, for the moment. James would like to not have
SORT_TYPE_STILL_IN_PROGRESS be part of the enum at all, I think,
and I can see his point --- but then we need some out-of-band
representation of "worker not done", so I'm not sure there'll be
any net reduction of cruft. Anyway that can be dealt with after
we have a stable buildfarm.

Note also that there's a separate comment-only patch in
<CAAaqYe9qzKbxCvSp3dfLkuS1v8KKnB7kW3z-hZ2jnAQaveSm8w@mail.gmail.com>
that shouldn't be forgotten about.

regards, tom lane

#319

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#318)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 11:00:37PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I came to the same conclusion (that the change in TuplesortMethod
definiton is the culprit) a while ago and was about to push a fix that
initialized it correctly in ExecSortInitializeDSM. But I agree reverting
it back to the old definition is probably better.

Yeah, for the moment. James would like to not have
SORT_TYPE_STILL_IN_PROGRESS be part of the enum at all, I think,
and I can see his point --- but then we need some out-of-band
representation of "worker not done", so I'm not sure there'll be
any net reduction of cruft. Anyway that can be dealt with after
we have a stable buildfarm.

Agreed.

Note also that there's a separate comment-only patch in
<CAAaqYe9qzKbxCvSp3dfLkuS1v8KKnB7kW3z-hZ2jnAQaveSm8w@mail.gmail.com>
that shouldn't be forgotten about.

OK, I'll take care of that tomorrow. I have two smaller patches to
commit in the incremental sort patchset, so I'll add it to that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#320

Justin Pryzby

pryzby@telsasoft.com

almost 6 years ago

In reply to: Tomas Vondra (#288)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:

I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit. Find attached some suggestions.

Shouldn't non-text explain output always show both disk *and* mem, including
zeros ?

Should "Pre-sorted Groups:" be on a separate line ?
| Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method: quicksort Memory: avg=30kB peak=30kB

And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
Groups"? I think you should maybe do that instead of the "semicolon
separator". I think "two spaces" makes sense, since the units are different,
similar to hash buckets and normal sort node.

"Buckets: %d Batches: %d Memory Usage: %ldkB\n",
appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",

Note, I made a similar comment regarding two spaces for explain(WAL) here:
/messages/by-id/20200402054120.GC14618@telsasoft.com

And Peter E seemed to dislike that, here:
/messages/by-id/ef8c966f-e50a-c583-7b1e-85de6f4ca0d3@2ndquadrant.com

Also, you're showing:
ExplainPropertyInteger("Maximum Sort Space Used", "kB",
groupInfo->maxMemorySpaceUsed, es);

But in show_hash_info() and show_hashagg_info(), and in your own text output,
that's called "Peak":
ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
ExplainPropertyInteger("Peak Memory Usage", "kB",
spacePeakKb, es);

--
Justin

Attachments:

v1-0001-comment-typos-and-others-Incremental-Sort.patchtext/x-diff; charset=us-asciiDownload

From e26f2cc842792fc3c0dd4b4b97c0996d450c6dd7 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Mon, 6 Apr 2020 17:37:31 -0500
Subject: [PATCH v1] comment typos and others: Incremental Sort

commit d2d8a229bc58a2014dce1c7a4fcdb6c5ab9fb8da
Author: Tomas Vondra <tomas.vondra@postgresql.org>
---
 doc/src/sgml/perform.sgml                  |  2 +-
 src/backend/commands/explain.c             | 17 +++++++-------
 src/backend/executor/nodeIncrementalSort.c | 26 +++++++++++-----------
 src/backend/utils/sort/tuplesort.c         | 10 ++++-----
 src/include/utils/tuplesort.h              |  2 +-
 5 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 0dfc3e80e2..f448abd073 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -311,7 +311,7 @@ EXPLAIN SELECT * FROM tenk1 ORDER BY unique1;
    ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
 </screen>
 
-    If the a part of the plan guarantess an ordering on a prefix of the
+    If a part of the plan guarantees an ordering on a prefix of the
     required sort keys, then the planner may instead decide to use an
     <literal>incremental sort</literal> step:
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index baaa5817af..ca4fe3307d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2532,7 +2532,7 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 
 	ExplainPropertyList(qlabel, result, es);
 	if (nPresortedKeys > 0)
-		ExplainPropertyList("Presorted Key", resultPresorted, es);
+		ExplainPropertyList("Pre-sorted Key", resultPresorted, es);
 }
 
 /*
@@ -2829,7 +2829,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 
 		ExplainPropertyList("Sort Methods Used", methodNames, es);
 
-		if (groupInfo->maxMemorySpaceUsed > 0)
 		{
 			long		avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
 			const char *spaceTypeName;
@@ -2841,12 +2840,12 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 			ExplainOpenGroup("Sort Space", memoryName.data, true, es);
 
 			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
-			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+			ExplainPropertyInteger("Peak Sort Space Used", "kB",
 								   groupInfo->maxMemorySpaceUsed, es);
 
 			ExplainCloseGroup("Sort Spaces", memoryName.data, true, es);
 		}
-		if (groupInfo->maxDiskSpaceUsed > 0)
+
 		{
 			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
 			const char *spaceTypeName;
@@ -2858,7 +2857,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 			ExplainOpenGroup("Sort Space", diskName.data, true, es);
 
 			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpace, es);
-			ExplainPropertyInteger("Maximum Sort Space Used", "kB",
+			ExplainPropertyInteger("Peak Sort Space Used", "kB",
 								   groupInfo->maxDiskSpaceUsed, es);
 
 			ExplainCloseGroup("Sort Spaces", diskName.data, true, es);
@@ -2869,7 +2868,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 }
 
 /*
- * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for an incremental sort node
  */
 static void
 show_incremental_sort_info(IncrementalSortState *incrsortstate,
@@ -2891,7 +2890,7 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 		{
 			if (es->format == EXPLAIN_FORMAT_TEXT)
 				appendStringInfo(es->str, " ");
-			show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+			show_incremental_sort_group_info(prefixsortGroupInfo, "Pre-sorted", false, es);
 		}
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 			appendStringInfo(es->str, "\n");
@@ -2908,7 +2907,7 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 			&incrsortstate->shared_info->sinfo[n];
 
 			/*
-			 * If a worker hasn't process any sort groups at all, then exclude
+			 * If a worker hasn't processed any sort groups at all, then exclude
 			 * it from output since it either didn't launch or didn't
 			 * contribute anything meaningful.
 			 */
@@ -2928,7 +2927,7 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 			{
 				if (es->format == EXPLAIN_FORMAT_TEXT)
 					appendStringInfo(es->str, " ");
-				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Pre-sorted", false, es);
 			}
 			if (es->format == EXPLAIN_FORMAT_TEXT)
 				appendStringInfo(es->str, "\n");
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index bcab7c054c..afdec2a0cd 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -270,7 +270,7 @@ isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, TupleTableSlot
  * verify they're all part of the same prefix key group before sorting them
  * solely by unsorted suffix keys.
  *
- * While it's likely that all already fetch tuples are all part of a single
+ * While it's likely that all already fetched tuples are all part of a single
  * prefix group, we also have to handle the possibility that there is at least
  * one different prefix key group before the large prefix key group.
  * ----------------------------------------------------------------
@@ -381,7 +381,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 				 * node->transfer_tuple slot, and, even though that slot
 				 * points to memory inside the full sort tuplesort, we can't
 				 * reset that tuplesort anyway until we've fully transferred
-				 * out of its tuples, so this reference is safe. We do need to
+				 * out its tuples, so this reference is safe. We do need to
 				 * reset the group pivot tuple though since we've finished the
 				 * current prefix key group.
 				 */
@@ -603,7 +603,7 @@ ExecIncrementalSort(PlanState *pstate)
 			/*
 			 * Initialize presorted column support structures for
 			 * isCurrentGroup(). It's correct to do this along with the
-			 * initial intialization for the full sort state (and not for the
+			 * initial initialization for the full sort state (and not for the
 			 * prefix sort state) since we always load the full sort state
 			 * first.
 			 */
@@ -723,7 +723,7 @@ ExecIncrementalSort(PlanState *pstate)
 				nTuples++;
 
 				/*
-				 * If we've reach our minimum group size, then we need to
+				 * If we've reached our minimum group size, then we need to
 				 * store the most recent tuple as a pivot.
 				 */
 				if (nTuples == minGroupSize)
@@ -752,7 +752,7 @@ ExecIncrementalSort(PlanState *pstate)
 				{
 					/*
 					 * Since the tuple we fetched isn't part of the current
-					 * prefix key group we don't want to  sort it as part of
+					 * prefix key group we don't want to sort it as part of
 					 * the current batch. Instead we use the group_pivot slot
 					 * to carry it over to the next batch (even though we
 					 * won't actually treat it as a group pivot).
@@ -792,12 +792,12 @@ ExecIncrementalSort(PlanState *pstate)
 			}
 
 			/*
-			 * Unless we've alrady transitioned modes to reading from the full
+			 * Unless we've already transitioned modes to reading from the full
 			 * sort state, then we assume that having read at least
 			 * DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples means it's likely we're
 			 * processing a large group of tuples all having equal prefix keys
 			 * (but haven't yet found the final tuple in that prefix key
-			 * group), so we need to transition in to presorted prefix mode.
+			 * group), so we need to transition into presorted prefix mode.
 			 */
 			if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
 				node->execution_status != INCSORT_READFULLSORT)
@@ -849,7 +849,7 @@ ExecIncrementalSort(PlanState *pstate)
 
 				/*
 				 * We might have multiple prefix key groups in the full sort
-				 * state, so the mode transition function needs to know the it
+				 * state, so the mode transition function needs to know that it
 				 * needs to move from the fullsort to presorted prefix sort.
 				 */
 				node->n_fullsort_remaining = nTuples;
@@ -913,7 +913,7 @@ ExecIncrementalSort(PlanState *pstate)
 			/*
 			 * If the tuple's prefix keys match our pivot tuple, we're not
 			 * done yet and can load it into the prefix sort state. If not, we
-			 * don't want to  sort it as part of the current batch. Instead we
+			 * don't want to sort it as part of the current batch. Instead we
 			 * use the group_pivot slot to carry it over to the next batch
 			 * (even though we won't actually treat it as a group pivot).
 			 */
@@ -987,7 +987,7 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 
 	/*
 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
-	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only ???? one of many sort
 	 * batches in the current sort state.
 	 */
 	Assert((eflags & (EXEC_FLAG_BACKWARD |
@@ -1121,14 +1121,14 @@ ExecReScanIncrementalSort(IncrementalSortState *node)
 	PlanState  *outerPlan = outerPlanState(node);
 
 	/*
-	 * Incremental sort doesn't support efficient rescan even when paramters
+	 * Incremental sort doesn't support efficient rescan even when parameters
 	 * haven't changed (e.g., rewind) because unlike regular sort we don't
 	 * store all tuples at once for the full sort.
 	 *
 	 * So even if EXEC_FLAG_REWIND is set we just reset all of our state and
 	 * reexecute the sort along with the child node below us.
 	 *
-	 * In theory if we've only fill the full sort with one batch (and haven't
+	 * In theory if we've only filled the full sort with one batch (and haven't
 	 * reset it for a new batch yet) then we could efficiently rewind, but
 	 * that seems a narrow enough case that it's not worth handling specially
 	 * at this time.
@@ -1153,7 +1153,7 @@ ExecReScanIncrementalSort(IncrementalSortState *node)
 	/*
 	 * If we've set up either of the sort states yet, we need to reset them.
 	 * We could end them and null out the pointers, but there's no reason to
-	 * repay the setup cost, and because guard setting up pivot comparator
+	 * repay the setup cost, and because ???? guard setting up pivot comparator
 	 * state similarly, doing so might actually cause a leak.
 	 */
 	if (node->fullsort_state != NULL)
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index cc33a85731..a965fb0025 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -808,7 +808,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
  *
  * Setup, or reset, all state need for processing a new set of tuples with this
  * sort state. Called both from tuplesort_begin_common (the first time sorting
- * with this sort state) and tuplesort_reseti (for subsequent usages).
+ * with this sort state) and tuplesort_reset (for subsequent usages).
  */
 static void
 tuplesort_begin_batch(Tuplesortstate *state)
@@ -1428,11 +1428,11 @@ tuplesort_updatemax(Tuplesortstate *state)
 	}
 
 	/*
-	 * Sort evicts data to the disk when it didn't manage to fit those data to
-	 * the main memory.  This is why we assume space used on the disk to be
+	 * Sort evicts data to the disk when it didn't manage to fit the data in
+	 * main memory.  This is why we assume space used on the disk to be
 	 * more important for tracking resource usage than space used in memory.
-	 * Note that amount of space occupied by some tuple set on the disk might
-	 * be less than amount of space occupied by the same tuple set in the
+	 * Note that amount of space occupied by some tupleset on the disk might
+	 * be less than amount of space occupied by the same tupleset in the
 	 * memory due to more compact representation.
 	 */
 	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 04d263228d..d992b4875a 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -63,7 +63,7 @@ typedef struct SortCoordinateData *SortCoordinate;
  * sometimes put it in shared memory.
  *
  * The parallel-sort infrastructure relies on having a zero TuplesortMethod
- * indicate that a worker never did anything, so we assign zero to
+ * to indicate that a worker never did anything, so we assign zero to
  * SORT_TYPE_STILL_IN_PROGRESS.  The other values of this enum can be
  * OR'ed together to represent a situation where different workers used
  * different methods, so we need a separate bit for each one.  Keep the
-- 
2.17.0

#321

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Justin Pryzby (#320)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:

I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit. Find attached some suggestions.

Shouldn't non-text explain output always show both disk *and* mem, including
zeros ?

Could you give more context on this? Is there a standard to follow?
Regular sort nodes only ever report one type, so there's not a good
parallel there.

Should "Pre-sorted Groups:" be on a separate line ?
| Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method: quicksort Memory: avg=30kB peak=30kB

I'd originally had that, but Tomas wanted it to be more compact. It's
easy to adjust though if the consensus changes on that.

And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
Groups"? I think you should maybe do that instead of the "semicolon
separator". I think "two spaces" makes sense, since the units are different,
similar to hash buckets and normal sort node.

"Buckets: %d Batches: %d Memory Usage: %ldkB\n",
appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",

Note, I made a similar comment regarding two spaces for explain(WAL) here:
/messages/by-id/20200402054120.GC14618@telsasoft.com

And Peter E seemed to dislike that, here:
/messages/by-id/ef8c966f-e50a-c583-7b1e-85de6f4ca0d3@2ndquadrant.com

I read through that subthread, and the ending seemed to be Peter
wanting things to be unified. Was there a conclusion beyond that?

Also, you're showing:
ExplainPropertyInteger("Maximum Sort Space Used", "kB",
groupInfo->maxMemorySpaceUsed, es);

But in show_hash_info() and show_hashagg_info(), and in your own text output,
that's called "Peak":
ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
ExplainPropertyInteger("Peak Memory Usage", "kB",
spacePeakKb, es);

Yes, that's a miss and should be fixed.

James

#322

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#321)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:

I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit. Find attached some suggestions.

Shouldn't non-text explain output always show both disk *and* mem, including
zeros ?

Could you give more context on this? Is there a standard to follow?
Regular sort nodes only ever report one type, so there's not a good
parallel there.

Should "Pre-sorted Groups:" be on a separate line ?
| Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method: quicksort Memory: avg=30kB peak=30kB

I'd originally had that, but Tomas wanted it to be more compact. It's
easy to adjust though if the consensus changes on that.

I'm OK with changing the format if there's a consensus. The current
format seemed better to me, but I'm not particularly attached to it.

And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
Groups"? I think you should maybe do that instead of the "semicolon
separator". I think "two spaces" makes sense, since the units are different,
similar to hash buckets and normal sort node.

"Buckets: %d Batches: %d Memory Usage: %ldkB\n",
appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",

Note, I made a similar comment regarding two spaces for explain(WAL) here:
/messages/by-id/20200402054120.GC14618@telsasoft.com

And Peter E seemed to dislike that, here:
/messages/by-id/ef8c966f-e50a-c583-7b1e-85de6f4ca0d3@2ndquadrant.com

I read through that subthread, and the ending seemed to be Peter
wanting things to be unified. Was there a conclusion beyond that?

Yeah, I don't think there was a clear consensus :-(

Also, you're showing:
ExplainPropertyInteger("Maximum Sort Space Used", "kB",
groupInfo->maxMemorySpaceUsed, es);

But in show_hash_info() and show_hashagg_info(), and in your own text output,
that's called "Peak":
ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
ExplainPropertyInteger("Peak Memory Usage", "kB",
spacePeakKb, es);

Yes, that's a miss and should be fixed.

Will fix.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#323

Justin Pryzby

pryzby@telsasoft.com

almost 6 years ago

In reply to: James Coleman (#321)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:

I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit. Find attached some suggestions.

Shouldn't non-text explain output always show both disk *and* mem, including
zeros ?

Could you give more context on this? Is there a standard to follow?
Regular sort nodes only ever report one type, so there's not a good
parallel there.

Right, I'm not sure either, since it seems to be a new case. Maybe Tomas has a
strong intuition.

See at least the commit messages here:
3ec20c7091e97a554e7447ac2b7f4ed795631395
7d91b604d9b5d6ec8c19c57a9ffd2f27129cdd94
8ebb69f85445177575684a0ba5cfedda8d840a91

Maybe this one suggests that it should *not* be present unconditionally, but
only when that sort type is used?
4b234fd8bf21cd6f5ff44f1f1c613bf40860998d

Another thought: is checking if bytes>0 really a good way to determine if a
sort type was used ? It seems like checking a bit or a pointer would be
better. I guess a size of 0 is unlikely, and it's ok at least in text mode.

if (groupInfo->maxMemorySpaceUsed > 0)

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted

...

I read through that subthread, and the ending seemed to be Peter
wanting things to be unified. Was there a conclusion beyond that?

This discussion is ongoing. I think let's wait until that's settled before
addressing this more complex and even newer case. We can add "explain, two
spaces and equals vs colon" to the "Open items" list if need be - I hope the
discussion will not delay the release.

--
Justin

#324

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Justin Pryzby (#320)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Mon, Apr 06, 2020 at 11:25:21PM -0500, Justin Pryzby wrote:

On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:

I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit. Find attached some suggestions.

Thanks. The typo fixes seem clear, except for this bit:

* If we've set up either of the sort states yet, we need to reset them.
* We could end them and null out the pointers, but there's no reason to
* repay the setup cost, and because ???? guard setting up pivot comparator
* state similarly, doing so might actually cause a leak.

I can't figure out what ???? should be. James, do you recall what this
should be?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#325

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#324)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Apr 7, 2020 at 7:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 11:25:21PM -0500, Justin Pryzby wrote:

On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:

I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit. Find attached some suggestions.

Thanks. The typo fixes seem clear, except for this bit:

* If we've set up either of the sort states yet, we need to reset them.
* We could end them and null out the pointers, but there's no reason to
* repay the setup cost, and because ???? guard setting up pivot comparator
* state similarly, doing so might actually cause a leak.

I can't figure out what ???? should be. James, do you recall what this
should be?

Yep, it's ExecIncrementalSort. If you look for the block guarded by
`if (fullsort_state == NULL)` you'll see the call to
preparePresortedCols(), which sets up the pivot comparator state
referenced by this comment.

James

#326

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#325)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Apr 07, 2020 at 07:50:26PM -0400, James Coleman wrote:

On Tue, Apr 7, 2020 at 7:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 11:25:21PM -0500, Justin Pryzby wrote:

On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:

I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit. Find attached some suggestions.

Thanks. The typo fixes seem clear, except for this bit:

* If we've set up either of the sort states yet, we need to reset them.
* We could end them and null out the pointers, but there's no reason to
* repay the setup cost, and because ???? guard setting up pivot comparator
* state similarly, doing so might actually cause a leak.

I can't figure out what ???? should be. James, do you recall what this
should be?

Yep, it's ExecIncrementalSort. If you look for the block guarded by
`if (fullsort_state == NULL)` you'll see the call to
preparePresortedCols(), which sets up the pivot comparator state
referenced by this comment.

OK, so it should be "... and because ExecIncrementalSort guard ..."?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#327

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#326)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Apr 7, 2020 at 7:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Tue, Apr 07, 2020 at 07:50:26PM -0400, James Coleman wrote:

On Tue, Apr 7, 2020 at 7:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Apr 06, 2020 at 11:25:21PM -0500, Justin Pryzby wrote:

On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:

I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit. Find attached some suggestions.

Thanks. The typo fixes seem clear, except for this bit:

* If we've set up either of the sort states yet, we need to reset them.
* We could end them and null out the pointers, but there's no reason to
* repay the setup cost, and because ???? guard setting up pivot comparator
* state similarly, doing so might actually cause a leak.

I can't figure out what ???? should be. James, do you recall what this
should be?

Yep, it's ExecIncrementalSort. If you look for the block guarded by
`if (fullsort_state == NULL)` you'll see the call to
preparePresortedCols(), which sets up the pivot comparator state
referenced by this comment.

OK, so it should be "... and because ExecIncrementalSort guard ..."?

Yes, "because ExecIncrementalSort guards presorted column functions by
checking to see if the full sort state has been initialized yet,
setting the sort states to null here might cause..." (that's more
specific IMO than my original "pivot comparator state...doing so").

James

#328

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#288)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Hi,

I've pushed the second part of this patch series, adding incremental
sort to additional places in the planner. As explained in the commit
message (and somewhere in this thread) we've decided to only update some
of the places that require sorted input (and do create_sort). This might
be overly cautious, I expect we'll add it to more places in the future.

As for the remaining part tweaking add_partial_path to also consider
startup cost (a bit confusingly 0001 - 0003 in the v54 patchset), I've
decided not to push it now and leave it for v14. The add_partial_path
change is simple, but I came to the conclusion that the "precheck"
function should be modified to follow the same logic - it would be a bit
strange if add_partial_path_precheck considered only total cost and
add_partial_path considered both startup and total cost. It would not
matter for most places because the add_partial_path_precheck is ony used
in join planning, but it's still strange.

I could have modified the add_partial_path_precheck too, but looking at
add_path_precheck we'd probably need to compute required_outer so that
we only compare startup_cost when really useful. Or we might simply
consider startup_cost every time and leave that up to add_partial_path,
but then that would be another difference in behavior.

That seems like way too much stuff to rework on the last day of the last
commitfest. It does mean we'll fail to generate the cheapest plan in some
cases (e.g. with LIMIT, there's an example in [1]/messages/by-id/20190720132244.3vgg2uynfpxh3me5@development) but that's a
pre-existing condition, not something introduced by incremental sort.

regards

[1]: /messages/by-id/20190720132244.3vgg2uynfpxh3me5@development

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#329

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Tomas Vondra (#328)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

hyrax is not too happy with this test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15

It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
EXPLAIN output, but it evidently is.

regards, tom lane

#330

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#329)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:

hyrax is not too happy with this test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15

It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
EXPLAIN output, but it evidently is.

Thanks, I'll investigate. It's not clear to me either what might be
causing this, but I guess something must have gone wrong in
estimation/planning.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#331

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#330)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:

On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:

hyrax is not too happy with this test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15

It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
EXPLAIN output, but it evidently is.

Thanks, I'll investigate. It's not clear to me either what might be
causing this, but I guess something must have gone wrong in
estimation/planning.

OK, I know what's going on - it's a rather embarassing issue in the
regression test. There's no analyze on the test tables, so it uses
default estimates for number of groups etc. But with clobber cache the
test runs long enough for autoanalyze to kick in and collect stats, so
we generate better estimates which changes the plan.

I'll get this fixed - explicit analyze and tweaking the data a bit
should do the trick.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#332

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#331)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Apr 8, 2020 at 9:43 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:

On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:

hyrax is not too happy with this test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15

It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
EXPLAIN output, but it evidently is.

Thanks, I'll investigate. It's not clear to me either what might be
causing this, but I guess something must have gone wrong in
estimation/planning.

OK, I know what's going on - it's a rather embarassing issue in the
regression test. There's no analyze on the test tables, so it uses
default estimates for number of groups etc. But with clobber cache the
test runs long enough for autoanalyze to kick in and collect stats, so
we generate better estimates which changes the plan.

I'll get this fixed - explicit analyze and tweaking the data a bit
should do the trick.

Looking at the tests that failed, I think we should consider just adding:
set enable_sort = off;
because several of those tests have very specific amounts of data to
ensure we test the transition points around the different modes in the
incremental sort node.

James

#333

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#332)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Apr 08, 2020 at 09:54:42AM -0400, James Coleman wrote:

On Wed, Apr 8, 2020 at 9:43 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:

On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:

hyrax is not too happy with this test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15

It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
EXPLAIN output, but it evidently is.

Thanks, I'll investigate. It's not clear to me either what might be
causing this, but I guess something must have gone wrong in
estimation/planning.

OK, I know what's going on - it's a rather embarassing issue in the
regression test. There's no analyze on the test tables, so it uses
default estimates for number of groups etc. But with clobber cache the
test runs long enough for autoanalyze to kick in and collect stats, so
we generate better estimates which changes the plan.

I'll get this fixed - explicit analyze and tweaking the data a bit
should do the trick.

Looking at the tests that failed, I think we should consider just adding:
set enable_sort = off;
because several of those tests have very specific amounts of data to
ensure we test the transition points around the different modes in the
incremental sort node.

Maybe, but I'd much rather tweak the data so that we test both the
costing and execution part.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#334

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#333)

1 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Apr 08, 2020 at 04:08:39PM +0200, Tomas Vondra wrote:

On Wed, Apr 08, 2020 at 09:54:42AM -0400, James Coleman wrote:

On Wed, Apr 8, 2020 at 9:43 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:

On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:

hyrax is not too happy with this test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15

It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
EXPLAIN output, but it evidently is.

Thanks, I'll investigate. It's not clear to me either what might be
causing this, but I guess something must have gone wrong in
estimation/planning.

OK, I know what's going on - it's a rather embarassing issue in the
regression test. There's no analyze on the test tables, so it uses
default estimates for number of groups etc. But with clobber cache the
test runs long enough for autoanalyze to kick in and collect stats, so
we generate better estimates which changes the plan.

I'll get this fixed - explicit analyze and tweaking the data a bit
should do the trick.

Looking at the tests that failed, I think we should consider just adding:
set enable_sort = off;
because several of those tests have very specific amounts of data to
ensure we test the transition points around the different modes in the
incremental sort node.

Maybe, but I'd much rather tweak the data so that we test both the
costing and execution part.

I do think this does the trick by increasing the number of rows a bit
(from 100 to 1000) to make the Sort more expensive than Incremental
Sort, while still testing the transition points.

James, can you verify it that's still true?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

incremental-sort-test-fix.patchtext/plain; charset=us-asciiDownload

diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 3072d95643..fb4ab95922 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -141,7 +141,8 @@ begin
 end;
 $$;
 -- A single large group tested around each mode transition point.
-insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+insert into t(a, b) select i/100 + 1, i + 1 from generate_series(0, 999) n(i);
+analyze;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
            QUERY PLAN            
 ---------------------------------
@@ -456,7 +457,8 @@ select * from (select * from t order by a) s order by a, b limit 66;
 
 delete from t;
 -- An initial large group followed by a small group.
-insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+insert into t(a, b) select i/50 + 1, i + 1 from generate_series(0, 999) n(i);
+analyze;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
            QUERY PLAN            
 ---------------------------------
@@ -521,7 +523,7 @@ select * from (select * from t order by a) s order by a, b limit 55;
  1 | 47
  1 | 48
  1 | 49
- 2 | 50
+ 1 | 50
  2 | 51
  2 | 52
  2 | 53
@@ -538,10 +540,10 @@ select explain_analyze_without_memory('select * from (select * from t order by a
          Sort Key: t.a, t.b
          Presorted Key: t.a
          Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
-         ->  Sort (actual rows=100 loops=1)
+         ->  Sort (actual rows=101 loops=1)
                Sort Key: t.a
                Sort Method: quicksort  Memory: NNkB
-               ->  Seq Scan on t (actual rows=100 loops=1)
+               ->  Seq Scan on t (actual rows=1000 loops=1)
 (9 rows)
 
 select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 55'));
@@ -584,7 +586,8 @@ select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select *
 
 delete from t;
 -- An initial small group followed by a large group.
-insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 1000) n(i);
+analyze;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
            QUERY PLAN            
 ---------------------------------
@@ -705,17 +708,17 @@ select * from t left join (select * from (select * from t order by a) v order by
 rollback;
 -- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
 select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
-                                                           explain_analyze_without_memory                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------
+                                                                    explain_analyze_without_memory                                                                    
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit (actual rows=70 loops=1)
    ->  Incremental Sort (actual rows=70 loops=1)
          Sort Key: t.a, t.b
          Presorted Key: t.a
-         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Method: quicksort Memory: avg=NNkB peak=NNkB
-         ->  Sort (actual rows=100 loops=1)
+         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         ->  Sort (actual rows=1000 loops=1)
                Sort Key: t.a
                Sort Method: quicksort  Memory: NNkB
-               ->  Seq Scan on t (actual rows=100 loops=1)
+               ->  Seq Scan on t (actual rows=1000 loops=1)
 (9 rows)
 
 select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
@@ -747,6 +750,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
          "Presorted Groups": {                  +
              "Group Count": 5,                  +
              "Sort Methods Used": [             +
+                 "top-N heapsort",              +
                  "quicksort"                    +
              ],                                 +
              "Sort Space Memory": {             +
@@ -767,7 +771,8 @@ select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select *
 
 delete from t;
 -- Small groups of 10 tuples each tested around each mode transition point.
-insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+insert into t(a, b) select i / 10, i from generate_series(1, 1000) n(i);
+analyze;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
            QUERY PLAN            
 ---------------------------------
@@ -1082,7 +1087,8 @@ select * from (select * from t order by a) s order by a, b limit 66;
 
 delete from t;
 -- Small groups of only 1 tuple each tested around each mode transition point.
-insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+insert into t(a, b) select i, i from generate_series(1, 1000) n(i);
+analyze;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
            QUERY PLAN            
 ---------------------------------
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index e78a96d5bf..cf304a3441 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -119,7 +119,8 @@ end;
 $$;
 
 -- A single large group tested around each mode transition point.
-insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+insert into t(a, b) select i/100 + 1, i + 1 from generate_series(0, 999) n(i);
+analyze;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
 select * from (select * from t order by a) s order by a, b limit 31;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
@@ -133,7 +134,8 @@ select * from (select * from t order by a) s order by a, b limit 66;
 delete from t;
 
 -- An initial large group followed by a small group.
-insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from generate_series(1, 100) n(i);
+insert into t(a, b) select i/50 + 1, i + 1 from generate_series(0, 999) n(i);
+analyze;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 55;
 select * from (select * from t order by a) s order by a, b limit 55;
 -- Test EXPLAIN ANALYZE with only a fullsort group.
@@ -143,7 +145,8 @@ select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select *
 delete from t;
 
 -- An initial small group followed by a large group.
-insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 100) n(i);
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from generate_series(1, 1000) n(i);
+analyze;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 70;
 select * from (select * from t order by a) s order by a, b limit 70;
 -- Test rescan.
@@ -164,7 +167,8 @@ select explain_analyze_inc_sort_nodes_verify_invariants('select * from (select *
 delete from t;
 
 -- Small groups of 10 tuples each tested around each mode transition point.
-insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+insert into t(a, b) select i / 10, i from generate_series(1, 1000) n(i);
+analyze;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
 select * from (select * from t order by a) s order by a, b limit 31;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;
@@ -178,7 +182,8 @@ select * from (select * from t order by a) s order by a, b limit 66;
 delete from t;
 
 -- Small groups of only 1 tuple each tested around each mode transition point.
-insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+insert into t(a, b) select i, i from generate_series(1, 1000) n(i);
+analyze;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 31;
 select * from (select * from t order by a) s order by a, b limit 31;
 explain (costs off) select * from (select * from t order by a) s order by a, b limit 32;

#335

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#334)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Apr 8, 2020 at 11:02 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Apr 08, 2020 at 04:08:39PM +0200, Tomas Vondra wrote:

On Wed, Apr 08, 2020 at 09:54:42AM -0400, James Coleman wrote:

On Wed, Apr 8, 2020 at 9:43 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:

On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:

hyrax is not too happy with this test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15

It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
EXPLAIN output, but it evidently is.

Thanks, I'll investigate. It's not clear to me either what might be
causing this, but I guess something must have gone wrong in
estimation/planning.

OK, I know what's going on - it's a rather embarassing issue in the
regression test. There's no analyze on the test tables, so it uses
default estimates for number of groups etc. But with clobber cache the
test runs long enough for autoanalyze to kick in and collect stats, so
we generate better estimates which changes the plan.

I'll get this fixed - explicit analyze and tweaking the data a bit
should do the trick.

Looking at the tests that failed, I think we should consider just adding:
set enable_sort = off;
because several of those tests have very specific amounts of data to
ensure we test the transition points around the different modes in the
incremental sort node.

Maybe, but I'd much rather tweak the data so that we test both the
costing and execution part.

I do think this does the trick by increasing the number of rows a bit
(from 100 to 1000) to make the Sort more expensive than Incremental
Sort, while still testing the transition points.

James, can you verify it that's still true?

Those changes all look good to me from a "testing correctness" POV.
Also I like that we now test multiple sort methods in the explain
output, like: "Sort Methods: top-N heapsort, quicksort".

I personally find the `i/100` notation harder to read than a case, but
that's just an opinion...

Should we change `analyze` to `analyze t` to avoid unnecessarily
re-analyzing all other tables in the regression db?

James

#336

David Steele

david@pgmasters.net

almost 6 years ago

In reply to: James Coleman (#335)

Re: [PATCH] Incremental sort

On 4/8/20 11:13 AM, James Coleman wrote:

James, can you verify it that's still true?

I marked this entry as committed in the 2020-03 CF but it's not clear to
me if that's entirely true. I'll leave it up to you (all) to move it to
the 2020-07 CF if there is remaining work (other than making the build
farm happy).

Regards,
--
-David
david@pgmasters.net

#337

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: James Coleman (#335)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

James Coleman <jtc331@gmail.com> writes:

Should we change `analyze` to `analyze t` to avoid unnecessarily
re-analyzing all other tables in the regression db?

Yes, a global analyze here is a remarkably horrid idea.

regards, tom lane

#338

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: David Steele (#336)

Re: [PATCH] Incremental sort

On Wed, Apr 8, 2020 at 11:29 AM David Steele <david@pgmasters.net> wrote:

On 4/8/20 11:13 AM, James Coleman wrote:

James, can you verify it that's still true?

I marked this entry as committed in the 2020-03 CF but it's not clear to
me if that's entirely true. I'll leave it up to you (all) to move it to
the 2020-07 CF if there is remaining work (other than making the build
farm happy).

Thanks.

I think it's true enough. The vast majority is committed, and the
small amount that isn't we'll leave as future improvements and
separate threads.

James

#339

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#335)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Wed, Apr 08, 2020 at 11:13:26AM -0400, James Coleman wrote:

On Wed, Apr 8, 2020 at 11:02 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Apr 08, 2020 at 04:08:39PM +0200, Tomas Vondra wrote:

On Wed, Apr 08, 2020 at 09:54:42AM -0400, James Coleman wrote:

On Wed, Apr 8, 2020 at 9:43 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:

On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:

hyrax is not too happy with this test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15

It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
EXPLAIN output, but it evidently is.

Thanks, I'll investigate. It's not clear to me either what might be
causing this, but I guess something must have gone wrong in
estimation/planning.

OK, I know what's going on - it's a rather embarassing issue in the
regression test. There's no analyze on the test tables, so it uses
default estimates for number of groups etc. But with clobber cache the
test runs long enough for autoanalyze to kick in and collect stats, so
we generate better estimates which changes the plan.

I'll get this fixed - explicit analyze and tweaking the data a bit
should do the trick.

Looking at the tests that failed, I think we should consider just adding:
set enable_sort = off;
because several of those tests have very specific amounts of data to
ensure we test the transition points around the different modes in the
incremental sort node.

Maybe, but I'd much rather tweak the data so that we test both the
costing and execution part.

I do think this does the trick by increasing the number of rows a bit
(from 100 to 1000) to make the Sort more expensive than Incremental
Sort, while still testing the transition points.

James, can you verify it that's still true?

Those changes all look good to me from a "testing correctness" POV.
Also I like that we now test multiple sort methods in the explain
output, like: "Sort Methods: top-N heapsort, quicksort".

OK, good. I'll push the fix.

I personally find the `i/100` notation harder to read than a case, but
that's just an opinion...

Yeah, but with 1000 rows we'd need a more complex CASE statement (I
don't think simply having two groups - small+large would work).

Should we change `analyze` to `analyze t` to avoid unnecessarily
re-analyzing all other tables in the regression db?

Ah, definitely. That was a mistake. Thanks for noticing.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#340

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: James Coleman (#338)

Re: [PATCH] Incremental sort

On Wed, Apr 08, 2020 at 11:42:12AM -0400, James Coleman wrote:

On Wed, Apr 8, 2020 at 11:29 AM David Steele <david@pgmasters.net> wrote:

On 4/8/20 11:13 AM, James Coleman wrote:

James, can you verify it that's still true?

I marked this entry as committed in the 2020-03 CF but it's not clear to
me if that's entirely true. I'll leave it up to you (all) to move it to
the 2020-07 CF if there is remaining work (other than making the build
farm happy).

Thanks.

I think it's true enough. The vast majority is committed, and the
small amount that isn't we'll leave as future improvements and
separate threads.

Right. The remaining bit is a generic issue not entirely specific to
incremental sort, so we better not hide it in this CF entry.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#341

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#339)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

One thing I just noticed and had a question about: in
preparePresortedCols (which sets up a function call context), do we
need to call pg_proc_aclcheck?

James

#342

James Coleman

jtc331@gmail.com

over 5 years ago

In reply to: James Coleman (#341)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Apr 10, 2020 at 10:12 AM James Coleman <jtc331@gmail.com> wrote:

One thing I just noticed and had a question about: in
preparePresortedCols (which sets up a function call context), do we
need to call pg_proc_aclcheck?

Background: this came up because I noticed that pg_proc_aclcheck is
called in the scalar array op case in execExpr.c.

However grepping through the source code I see several places where a
function (including an equality op for an ordering op, like the case
we have here) gets looked up without calling pg_proc_aclcheck, but
then other places where the acl check is invoked.

In addition, I haven't been able to discern a reason for why sometimes
InvokeFunctionExecuteHook gets called with the function after lookup,
but not others.

So I'm not sure if either of these needed to be added to the equality
op/function lookup code in nodeIncrementalSort's preparePresortedCols
or not.

James

#343

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: James Coleman (#342)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

James Coleman <jtc331@gmail.com> writes:

On Fri, Apr 10, 2020 at 10:12 AM James Coleman <jtc331@gmail.com> wrote:

One thing I just noticed and had a question about: in
preparePresortedCols (which sets up a function call context), do we
need to call pg_proc_aclcheck?

Background: this came up because I noticed that pg_proc_aclcheck is
called in the scalar array op case in execExpr.c.

However grepping through the source code I see several places where a
function (including an equality op for an ordering op, like the case
we have here) gets looked up without calling pg_proc_aclcheck, but
then other places where the acl check is invoked.

Rule of thumb is that we don't apply ACL checks to functions/ops
we get out of an opclass; adding a function to an opclass is tantamount
to giving public execute permission on it. If the function/operator
reference came directly from the SQL query it must be checked.

In addition, I haven't been able to discern a reason for why sometimes
InvokeFunctionExecuteHook gets called with the function after lookup,
but not others.

I would not stand here and say that that hook infrastructure is worth
anything at all. Maybe the coverage is sufficient for some use-cases,
but who's to say?

regards, tom lane

#344

James Coleman

jtc331@gmail.com

over 5 years ago

In reply to: Tom Lane (#343)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Apr 16, 2020 at 1:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

James Coleman <jtc331@gmail.com> writes:

On Fri, Apr 10, 2020 at 10:12 AM James Coleman <jtc331@gmail.com> wrote:

One thing I just noticed and had a question about: in
preparePresortedCols (which sets up a function call context), do we
need to call pg_proc_aclcheck?

Background: this came up because I noticed that pg_proc_aclcheck is
called in the scalar array op case in execExpr.c.

However grepping through the source code I see several places where a
function (including an equality op for an ordering op, like the case
we have here) gets looked up without calling pg_proc_aclcheck, but
then other places where the acl check is invoked.

Rule of thumb is that we don't apply ACL checks to functions/ops
we get out of an opclass; adding a function to an opclass is tantamount
to giving public execute permission on it. If the function/operator
reference came directly from the SQL query it must be checked.

All right, in that case I believe we're OK here without modification.
We're looking up the equality op based on the ordering op the planner
has already selected for sorting the query, and I'm assuming that
looking that up via the op family is in the same category as "getting
out of an opclass" (since opclasses are part of an opfamily).

Thanks for the explanation.

In addition, I haven't been able to discern a reason for why sometimes
InvokeFunctionExecuteHook gets called with the function after lookup,
but not others.

I would not stand here and say that that hook infrastructure is worth
anything at all. Maybe the coverage is sufficient for some use-cases,
but who's to say?

Interesting. It does look to be particularly underused. Just grepping
for that hook invocation macro shows, for example, that it's not used
in nodeSort.c or tuplesort.c, so clearly it's not executed for the
functions we'd use in regular sort. Given that...I think we can
proceed without it here too.

James

#345

Justin Pryzby

pryzby@telsasoft.com

over 5 years ago

In reply to: Justin Pryzby (#323)

3 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Apr 07, 2020 at 10:53:05AM -0500, Justin Pryzby wrote:

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted

...

I read through that subthread, and the ending seemed to be Peter
wanting things to be unified. Was there a conclusion beyond that?

This discussion is ongoing. I think let's wait until that's settled before
addressing this more complex and even newer case. We can add "explain, two
spaces and equals vs colon" to the "Open items" list if need be - I hope the
discussion will not delay the release.

The change proposed on the WAL thread is minimal, and makes new explain(WAL)
output consistent with the that of explain(BUFFERS).

That uses a different format from "Sort", which is what incremental sort should
follow. (Hashjoin also uses the Sort's format of two-spaces and colons rather
than equals).

So the attached 0001 makes explain output for incremental sort more consistent
with sort:

- Two spaces;
- colons rather than equals;
- Don't use semicolon, which isn't in use anywhere else;

I tested with this:
template1=# DROP TABLE t; CREATE TABLE t(i int, j int); INSERT INTO t SELECT a-(a%100), a%1000 FROM generate_series(1,99999)a; CREATE INDEX ON t(i); VACUUM VERBOSE ANALYZE t;
template1=# explain analyze SELECT * FROM t a ORDER BY i,j;
...
Full-sort Groups: 1000 Sort Method: quicksort Average Memory: 28kB Peak Memory: 28kB Pre-sorted Groups: 1000 Sort Method: quicksort Average Memory: 30kB Peak Memory: 30kB

On Tue, Apr 07, 2020 at 05:34:15PM +0200, Tomas Vondra wrote:

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

Should "Pre-sorted Groups:" be on a separate line ?
| Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method: quicksort Memory: avg=30kB peak=30kB

I'd originally had that, but Tomas wanted it to be more compact. It's
easy to adjust though if the consensus changes on that.

I'm OK with changing the format if there's a consensus. The current
format seemed better to me, but I'm not particularly attached to it.

I still think Pre-sorted groups should be on a separate line, as in 0002.
In addition to looking better (to me), and being easier to read, another reason
is that there are essentially key=>values here, but the keys are repeated (Sort
Method, etc).

I also suggested to rename: s/Presorted/Pre-sorted/, but I didn't do that here.

Michael already patched most of the comment typos, the remainder I'm including
here as a "nearby patch"..

--
Justin

Attachments:

v1-0001-Fix-explain-output-for-incr-sort.patchtext/x-diff; charset=us-asciiDownload

From 55044341f82b847d136cd17df5a3c8d80c8371b4 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Wed, 15 Apr 2020 08:45:21 -0500
Subject: [PATCH v1 1/3] Fix explain output for incr sort:

 - Two spaces;
 - colons rather than equals;
 - Don't use semicolon;
---
 src/backend/commands/explain.c                 | 18 +++++++-----------
 src/test/regress/expected/incremental_sort.out | 12 ++++++------
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7ae6131676..9257c52707 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2778,7 +2778,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 	{
 		if (indent)
 			appendStringInfoSpaces(es->str, es->indent * 2);
-		appendStringInfo(es->str, "%s Groups: " INT64_FORMAT " Sort Method", groupLabel,
+		appendStringInfo(es->str, "%s Groups: " INT64_FORMAT "  Sort Method", groupLabel,
 						 groupInfo->groupCount);
 		/* plural/singular based on methodNames size */
 		if (list_length(methodNames) > 1)
@@ -2798,24 +2798,20 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 			const char *spaceTypeName;
 
 			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
-			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+			appendStringInfo(es->str, "  Average %s: %ldkB  Peak %s: %ldkB",
 							 spaceTypeName, avgSpace,
-							 groupInfo->maxMemorySpaceUsed);
+							 spaceTypeName, groupInfo->maxMemorySpaceUsed);
 		}
 
 		if (groupInfo->maxDiskSpaceUsed > 0)
 		{
 			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
-
 			const char *spaceTypeName;
 
 			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
-			/* Add a semicolon separator only if memory stats were printed. */
-			if (groupInfo->maxMemorySpaceUsed > 0)
-				appendStringInfo(es->str, ";");
-			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+			appendStringInfo(es->str, "  Average %s: %ldkB  Peak %s: %ldkB",
 							 spaceTypeName, avgSpace,
-							 groupInfo->maxDiskSpaceUsed);
+							 spaceTypeName, groupInfo->maxDiskSpaceUsed);
 		}
 	}
 	else
@@ -2899,7 +2895,7 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 		if (prefixsortGroupInfo->groupCount > 0)
 		{
 			if (es->format == EXPLAIN_FORMAT_TEXT)
-				appendStringInfo(es->str, " ");
+				appendStringInfo(es->str, "  ");
 			show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
 		}
 		if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -2943,7 +2939,7 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 			if (prefixsortGroupInfo->groupCount > 0)
 			{
 				if (es->format == EXPLAIN_FORMAT_TEXT)
-					appendStringInfo(es->str, " ");
+					appendStringInfo(es->str, "  ");
 				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
 			}
 			if (es->format == EXPLAIN_FORMAT_TEXT)
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 238d89a206..cf157a7aa1 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -533,13 +533,13 @@ select * from (select * from t order by a) s order by a, b limit 55;
 
 -- Test EXPLAIN ANALYZE with only a fullsort group.
 select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
-                                 explain_analyze_without_memory                                 
-------------------------------------------------------------------------------------------------
+                                        explain_analyze_without_memory                                         
+---------------------------------------------------------------------------------------------------------------
  Limit (actual rows=55 loops=1)
    ->  Incremental Sort (actual rows=55 loops=1)
          Sort Key: t.a, t.b
          Presorted Key: t.a
-         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         Full-sort Groups: 2  Sort Methods: top-N heapsort, quicksort  Average Memory: NNkB  Peak Memory: NNkB
          ->  Sort (actual rows=101 loops=1)
                Sort Key: t.a
                Sort Method: quicksort  Memory: NNkB
@@ -708,13 +708,13 @@ select * from t left join (select * from (select * from t order by a) v order by
 rollback;
 -- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
 select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
-                                                                    explain_analyze_without_memory                                                                    
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+                                                                                   explain_analyze_without_memory                                                                                    
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit (actual rows=70 loops=1)
    ->  Incremental Sort (actual rows=70 loops=1)
          Sort Key: t.a, t.b
          Presorted Key: t.a
-         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         Full-sort Groups: 1  Sort Method: quicksort  Average Memory: NNkB  Peak Memory: NNkB  Presorted Groups: 5  Sort Methods: top-N heapsort, quicksort  Average Memory: NNkB  Peak Memory: NNkB
          ->  Sort (actual rows=1000 loops=1)
                Sort Key: t.a
                Sort Method: quicksort  Memory: NNkB
-- 
2.17.0

v1-0002-Put-Pre-sorted-groups-on-a-separate-line.patchtext/x-diff; charset=us-asciiDownload

From c8c4691e66d72c847d24ab547afa96f30fec1870 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Sat, 18 Apr 2020 21:02:59 -0500
Subject: [PATCH v1 2/3] Put Pre-sorted groups on a separate line

---
 src/backend/commands/explain.c                 | 10 ++++++++--
 src/test/regress/expected/incremental_sort.out |  9 +++++----
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9257c52707..2ec5d5b810 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2895,7 +2895,10 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 		if (prefixsortGroupInfo->groupCount > 0)
 		{
 			if (es->format == EXPLAIN_FORMAT_TEXT)
-				appendStringInfo(es->str, "  ");
+			{
+				appendStringInfo(es->str, "\n");
+				ExplainIndentText(es);
+			}
 			show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
 		}
 		if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -2939,7 +2942,10 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 			if (prefixsortGroupInfo->groupCount > 0)
 			{
 				if (es->format == EXPLAIN_FORMAT_TEXT)
-					appendStringInfo(es->str, "  ");
+				{
+					appendStringInfo(es->str, "\n");
+					ExplainIndentText(es);
+				}
 				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
 			}
 			if (es->format == EXPLAIN_FORMAT_TEXT)
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index cf157a7aa1..3460d2bd6f 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -708,18 +708,19 @@ select * from t left join (select * from (select * from t order by a) v order by
 rollback;
 -- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
 select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
-                                                                                   explain_analyze_without_memory                                                                                    
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+                                        explain_analyze_without_memory                                         
+---------------------------------------------------------------------------------------------------------------
  Limit (actual rows=70 loops=1)
    ->  Incremental Sort (actual rows=70 loops=1)
          Sort Key: t.a, t.b
          Presorted Key: t.a
-         Full-sort Groups: 1  Sort Method: quicksort  Average Memory: NNkB  Peak Memory: NNkB  Presorted Groups: 5  Sort Methods: top-N heapsort, quicksort  Average Memory: NNkB  Peak Memory: NNkB
+         Full-sort Groups: 1  Sort Method: quicksort  Average Memory: NNkB  Peak Memory: NNkB
+         Presorted Groups: 5  Sort Methods: top-N heapsort, quicksort  Average Memory: NNkB  Peak Memory: NNkB
          ->  Sort (actual rows=1000 loops=1)
                Sort Key: t.a
                Sort Method: quicksort  Memory: NNkB
                ->  Seq Scan on t (actual rows=1000 loops=1)
-(9 rows)
+(10 rows)
 
 select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
                   jsonb_pretty                   
-- 
2.17.0

v1-0003-comment-typos-Incremental-Sort.patchtext/x-diff; charset=us-asciiDownload

From f18a0d1f28c0ab8b9cd7e33ce7445830faa6e20d Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Mon, 6 Apr 2020 17:37:31 -0500
Subject: [PATCH v1 3/3] comment typos: Incremental Sort

commit d2d8a229bc58a2014dce1c7a4fcdb6c5ab9fb8da
Author: Tomas Vondra <tomas.vondra@postgresql.org>

Previously reported here:
https://www.postgresql.org/message-id/20200407042521.GH2228%40telsasoft.com
---
 src/backend/commands/explain.c             | 4 ++--
 src/backend/executor/nodeIncrementalSort.c | 8 +++++---
 src/backend/utils/sort/tuplesort.c         | 8 ++++----
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 2ec5d5b810..466666635b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2865,7 +2865,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 }
 
 /*
- * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for an incremental sort node
  */
 static void
 show_incremental_sort_info(IncrementalSortState *incrsortstate,
@@ -2916,7 +2916,7 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 			&incrsortstate->shared_info->sinfo[n];
 
 			/*
-			 * If a worker hasn't process any sort groups at all, then exclude
+			 * If a worker hasn't processed any sort groups at all, then exclude
 			 * it from output since it either didn't launch or didn't
 			 * contribute anything meaningful.
 			 */
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 39ba11cdf7..da99453c91 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -987,7 +987,7 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 
 	/*
 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
-	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only ???? one of many sort
 	 * batches in the current sort state.
 	 */
 	Assert((eflags & (EXEC_FLAG_BACKWARD |
@@ -1153,8 +1153,10 @@ ExecReScanIncrementalSort(IncrementalSortState *node)
 	/*
 	 * If we've set up either of the sort states yet, we need to reset them.
 	 * We could end them and null out the pointers, but there's no reason to
-	 * repay the setup cost, and because guard setting up pivot comparator
-	 * state similarly, doing so might actually cause a leak.
+	 * repay the setup cost, and because ExecIncrementalSort guards
+	 * presorted column functions by checking to see if the full sort state
+	 * has been initialized yet, setting the sort states to null here might
+	 * actually cause a leak.
 	 */
 	if (node->fullsort_state != NULL)
 	{
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index de38c6c7e0..c25a22f79b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1428,11 +1428,11 @@ tuplesort_updatemax(Tuplesortstate *state)
 	}
 
 	/*
-	 * Sort evicts data to the disk when it didn't manage to fit those data to
-	 * the main memory.  This is why we assume space used on the disk to be
+	 * Sort evicts data to the disk when it didn't fit data in
+	 * main memory.  This is why we assume space used on the disk to be
 	 * more important for tracking resource usage than space used in memory.
-	 * Note that amount of space occupied by some tuple set on the disk might
-	 * be less than amount of space occupied by the same tuple set in the
+	 * Note that the amount of space occupied by some tupleset on the disk might
+	 * be less than amount of space occupied by the same tupleset in
 	 * memory due to more compact representation.
 	 */
 	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
-- 
2.17.0

#346

James Coleman

jtc331@gmail.com

over 5 years ago

In reply to: Justin Pryzby (#345)

2 attachment(s)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, Apr 18, 2020 at 10:36 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Apr 07, 2020 at 10:53:05AM -0500, Justin Pryzby wrote:

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted

...

I read through that subthread, and the ending seemed to be Peter
wanting things to be unified. Was there a conclusion beyond that?

This discussion is ongoing. I think let's wait until that's settled before
addressing this more complex and even newer case. We can add "explain, two
spaces and equals vs colon" to the "Open items" list if need be - I hope the
discussion will not delay the release.

The change proposed on the WAL thread is minimal, and makes new explain(WAL)
output consistent with the that of explain(BUFFERS).

That uses a different format from "Sort", which is what incremental sort should
follow. (Hashjoin also uses the Sort's format of two-spaces and colons rather
than equals).

I think it's not great that buffers/sort are different, but I agree
that we should follow sort.

So the attached 0001 makes explain output for incremental sort more consistent
with sort:

- Two spaces;
- colons rather than equals;
- Don't use semicolon, which isn't in use anywhere else;

I tested with this:
template1=# DROP TABLE t; CREATE TABLE t(i int, j int); INSERT INTO t SELECT a-(a%100), a%1000 FROM generate_series(1,99999)a; CREATE INDEX ON t(i); VACUUM VERBOSE ANALYZE t;
template1=# explain analyze SELECT * FROM t a ORDER BY i,j;
...
Full-sort Groups: 1000 Sort Method: quicksort Average Memory: 28kB Peak Memory: 28kB Pre-sorted Groups: 1000 Sort Method: quicksort Average Memory: 30kB Peak Memory: 30kB

On Tue, Apr 07, 2020 at 05:34:15PM +0200, Tomas Vondra wrote:

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

Should "Pre-sorted Groups:" be on a separate line ?
| Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method: quicksort Memory: avg=30kB peak=30kB

I'd originally had that, but Tomas wanted it to be more compact. It's
easy to adjust though if the consensus changes on that.

I'm OK with changing the format if there's a consensus. The current
format seemed better to me, but I'm not particularly attached to it.

I still think Pre-sorted groups should be on a separate line, as in 0002.
In addition to looking better (to me), and being easier to read, another reason
is that there are essentially key=>values here, but the keys are repeated (Sort
Method, etc).

I collapsed this into 0001 because I think that if we're going to do
away with the key=value style then we effectively to have to do this
to avoid the repeated values being confusing (with key=value it kinda
worked, because that made it seem like the avg/peak were clearly a
subset of the Sort Groups info).

I also cleaned up the newline patch a bit in the process (we already
have a way to indent with a flag so don't need to do it directly).

I also suggested to rename: s/Presorted/Pre-sorted/, but I didn't do that here.

I went ahead and did that too; we already use "Full-sort", so the
proposed change makes both parallel.

Michael already patched most of the comment typos, the remainder I'm including
here as a "nearby patch"..

Modified slightly.

James

Attachments:

v2-0002-comment-typos-Incremental-Sort.patchtext/x-patch; charset=US-ASCII; name=v2-0002-comment-typos-Incremental-Sort.patchDownload

From becd60ba348558fa241db5cc2100a84b757cbdc5 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Mon, 6 Apr 2020 17:37:31 -0500
Subject: [PATCH v2 2/2] comment typos: Incremental Sort

commit d2d8a229bc58a2014dce1c7a4fcdb6c5ab9fb8da
Author: Tomas Vondra <tomas.vondra@postgresql.org>

Previously reported here:
https://www.postgresql.org/message-id/20200407042521.GH2228%40telsasoft.com
---
 src/backend/commands/explain.c             |  4 ++--
 src/backend/executor/nodeIncrementalSort.c | 10 ++++++----
 src/backend/utils/sort/tuplesort.c         |  8 ++++----
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5f91c569a0..86c10458f4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2865,7 +2865,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 }
 
 /*
- * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for an incremental sort node
  */
 static void
 show_incremental_sort_info(IncrementalSortState *incrsortstate,
@@ -2913,7 +2913,7 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 			&incrsortstate->shared_info->sinfo[n];
 
 			/*
-			 * If a worker hasn't process any sort groups at all, then exclude
+			 * If a worker hasn't processed any sort groups at all, then exclude
 			 * it from output since it either didn't launch or didn't
 			 * contribute anything meaningful.
 			 */
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 39ba11cdf7..05c60ec3e0 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -987,8 +987,8 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
 
 	/*
 	 * Incremental sort can't be used with either EXEC_FLAG_REWIND,
-	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
-	 * batches in the current sort state.
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because the current sort state
+	 * contains only one sort batch rather than the full result set.
 	 */
 	Assert((eflags & (EXEC_FLAG_BACKWARD |
 					  EXEC_FLAG_MARK)) == 0);
@@ -1153,8 +1153,10 @@ ExecReScanIncrementalSort(IncrementalSortState *node)
 	/*
 	 * If we've set up either of the sort states yet, we need to reset them.
 	 * We could end them and null out the pointers, but there's no reason to
-	 * repay the setup cost, and because guard setting up pivot comparator
-	 * state similarly, doing so might actually cause a leak.
+	 * repay the setup cost, and because ExecIncrementalSort guards
+	 * presorted column functions by checking to see if the full sort state
+	 * has been initialized yet, setting the sort states to null here might
+	 * actually cause a leak.
 	 */
 	if (node->fullsort_state != NULL)
 	{
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index de38c6c7e0..d59e3d5a8d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1428,11 +1428,11 @@ tuplesort_updatemax(Tuplesortstate *state)
 	}
 
 	/*
-	 * Sort evicts data to the disk when it didn't manage to fit those data to
-	 * the main memory.  This is why we assume space used on the disk to be
+	 * Sort evicts data to the disk when it wasn't able to fit that data into
+	 * main memory.  This is why we assume space used on the disk to be
 	 * more important for tracking resource usage than space used in memory.
-	 * Note that amount of space occupied by some tuple set on the disk might
-	 * be less than amount of space occupied by the same tuple set in the
+	 * Note that the amount of space occupied by some tupleset on the disk might
+	 * be less than amount of space occupied by the same tupleset in
 	 * memory due to more compact representation.
 	 */
 	if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
-- 
2.17.1

v2-0001-Fix-explain-output-for-incr-sort.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-explain-output-for-incr-sort.patchDownload

From 8e80be2b345c3940f76ffbd5e3c201a7ae855784 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Wed, 15 Apr 2020 08:45:21 -0500
Subject: [PATCH v2 1/2] Fix explain output for incr sort:

 - Two spaces
 - colons rather than equals
 - Don't use semicolon
 - Put Pre-sorted groups on a separate line
 - Rename Presorted to Pre-sorted (to match Full-sort)
---
 src/backend/commands/explain.c                | 22 ++++++++-----------
 .../regress/expected/incremental_sort.out     | 21 +++++++++---------
 src/test/regress/sql/incremental_sort.sql     |  4 ++--
 3 files changed, 22 insertions(+), 25 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7ae6131676..5f91c569a0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2778,7 +2778,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 	{
 		if (indent)
 			appendStringInfoSpaces(es->str, es->indent * 2);
-		appendStringInfo(es->str, "%s Groups: " INT64_FORMAT " Sort Method", groupLabel,
+		appendStringInfo(es->str, "%s Groups: " INT64_FORMAT "  Sort Method", groupLabel,
 						 groupInfo->groupCount);
 		/* plural/singular based on methodNames size */
 		if (list_length(methodNames) > 1)
@@ -2798,24 +2798,20 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
 			const char *spaceTypeName;
 
 			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
-			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+			appendStringInfo(es->str, "  Average %s: %ldkB  Peak %s: %ldkB",
 							 spaceTypeName, avgSpace,
-							 groupInfo->maxMemorySpaceUsed);
+							 spaceTypeName, groupInfo->maxMemorySpaceUsed);
 		}
 
 		if (groupInfo->maxDiskSpaceUsed > 0)
 		{
 			long		avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
-
 			const char *spaceTypeName;
 
 			spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
-			/* Add a semicolon separator only if memory stats were printed. */
-			if (groupInfo->maxMemorySpaceUsed > 0)
-				appendStringInfo(es->str, ";");
-			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+			appendStringInfo(es->str, "  Average %s: %ldkB  Peak %s: %ldkB",
 							 spaceTypeName, avgSpace,
-							 groupInfo->maxDiskSpaceUsed);
+							 spaceTypeName, groupInfo->maxDiskSpaceUsed);
 		}
 	}
 	else
@@ -2899,8 +2895,8 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 		if (prefixsortGroupInfo->groupCount > 0)
 		{
 			if (es->format == EXPLAIN_FORMAT_TEXT)
-				appendStringInfo(es->str, " ");
-			show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+				appendStringInfo(es->str, "\n");
+			show_incremental_sort_group_info(prefixsortGroupInfo, "Pre-sorted", true, es);
 		}
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 			appendStringInfo(es->str, "\n");
@@ -2943,8 +2939,8 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
 			if (prefixsortGroupInfo->groupCount > 0)
 			{
 				if (es->format == EXPLAIN_FORMAT_TEXT)
-					appendStringInfo(es->str, " ");
-				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+					appendStringInfo(es->str, "\n");
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Pre-sorted", true, es);
 			}
 			if (es->format == EXPLAIN_FORMAT_TEXT)
 				appendStringInfo(es->str, "\n");
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 238d89a206..2b40a26d82 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -106,7 +106,7 @@ declare
   space_key text;
 begin
   for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
-    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
       for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
         node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
         node := jsonb_set(node, array[group_key, space_key, 'Peak Sort Space Used'], '"NN"', false);
@@ -128,7 +128,7 @@ declare
   space_key text;
 begin
   for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
-    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
       group_stats := node->group_key;
       for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
         if (group_stats->space_key->'Peak Sort Space Used')::bigint < (group_stats->space_key->'Peak Sort Space Used')::bigint then
@@ -533,13 +533,13 @@ select * from (select * from t order by a) s order by a, b limit 55;
 
 -- Test EXPLAIN ANALYZE with only a fullsort group.
 select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
-                                 explain_analyze_without_memory                                 
-------------------------------------------------------------------------------------------------
+                                        explain_analyze_without_memory                                         
+---------------------------------------------------------------------------------------------------------------
  Limit (actual rows=55 loops=1)
    ->  Incremental Sort (actual rows=55 loops=1)
          Sort Key: t.a, t.b
          Presorted Key: t.a
-         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         Full-sort Groups: 2  Sort Methods: top-N heapsort, quicksort  Average Memory: NNkB  Peak Memory: NNkB
          ->  Sort (actual rows=101 loops=1)
                Sort Key: t.a
                Sort Method: quicksort  Memory: NNkB
@@ -708,18 +708,19 @@ select * from t left join (select * from (select * from t order by a) v order by
 rollback;
 -- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
 select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
-                                                                    explain_analyze_without_memory                                                                    
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+                                         explain_analyze_without_memory                                         
+----------------------------------------------------------------------------------------------------------------
  Limit (actual rows=70 loops=1)
    ->  Incremental Sort (actual rows=70 loops=1)
          Sort Key: t.a, t.b
          Presorted Key: t.a
-         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         Full-sort Groups: 1  Sort Method: quicksort  Average Memory: NNkB  Peak Memory: NNkB
+         Pre-sorted Groups: 5  Sort Methods: top-N heapsort, quicksort  Average Memory: NNkB  Peak Memory: NNkB
          ->  Sort (actual rows=1000 loops=1)
                Sort Key: t.a
                Sort Method: quicksort  Memory: NNkB
                ->  Seq Scan on t (actual rows=1000 loops=1)
-(9 rows)
+(10 rows)
 
 select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
                   jsonb_pretty                   
@@ -747,7 +748,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
                  "Average Sort Space Used": "NN"+
              }                                  +
          },                                     +
-         "Presorted Groups": {                  +
+         "Pre-sorted Groups": {                 +
              "Group Count": 5,                  +
              "Sort Methods Used": [             +
                  "top-N heapsort",              +
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index 2241cc9c02..6f70b36a81 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -82,7 +82,7 @@ declare
   space_key text;
 begin
   for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
-    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
       for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
         node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
         node := jsonb_set(node, array[group_key, space_key, 'Peak Sort Space Used'], '"NN"', false);
@@ -105,7 +105,7 @@ declare
   space_key text;
 begin
   for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
-    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
       group_stats := node->group_key;
       for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
         if (group_stats->space_key->'Peak Sort Space Used')::bigint < (group_stats->space_key->'Peak Sort Space Used')::bigint then
-- 
2.17.1

#347

Justin Pryzby

pryzby@telsasoft.com

over 5 years ago

In reply to: James Coleman (#346)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Checking if it's possible to address this Opened Item before 13b1.

https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items
consistency of explain output: two spaces, equals vs colons, semicolons (incremental sort)

On Sun, Apr 19, 2020 at 09:46:55AM -0400, James Coleman wrote:

On Sat, Apr 18, 2020 at 10:36 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Apr 07, 2020 at 10:53:05AM -0500, Justin Pryzby wrote:

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted

...

I read through that subthread, and the ending seemed to be Peter
wanting things to be unified. Was there a conclusion beyond that?

This discussion is ongoing. I think let's wait until that's settled before
addressing this more complex and even newer case. We can add "explain, two
spaces and equals vs colon" to the "Open items" list if need be - I hope the
discussion will not delay the release.

The change proposed on the WAL thread is minimal, and makes new explain(WAL)
output consistent with the that of explain(BUFFERS).

That uses a different format from "Sort", which is what incremental sort should
follow. (Hashjoin also uses the Sort's format of two-spaces and colons rather
than equals).

I think it's not great that buffers/sort are different, but I agree
that we should follow sort.

So the attached 0001 makes explain output for incremental sort more consistent
with sort:

- Two spaces;
- colons rather than equals;
- Don't use semicolon, which isn't in use anywhere else;

I tested with this:
template1=# DROP TABLE t; CREATE TABLE t(i int, j int); INSERT INTO t SELECT a-(a%100), a%1000 FROM generate_series(1,99999)a; CREATE INDEX ON t(i); VACUUM VERBOSE ANALYZE t;
template1=# explain analyze SELECT * FROM t a ORDER BY i,j;
...
Full-sort Groups: 1000 Sort Method: quicksort Average Memory: 28kB Peak Memory: 28kB Pre-sorted Groups: 1000 Sort Method: quicksort Average Memory: 30kB Peak Memory: 30kB

On Tue, Apr 07, 2020 at 05:34:15PM +0200, Tomas Vondra wrote:

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

Should "Pre-sorted Groups:" be on a separate line ?
| Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method: quicksort Memory: avg=30kB peak=30kB

I'd originally had that, but Tomas wanted it to be more compact. It's
easy to adjust though if the consensus changes on that.

I'm OK with changing the format if there's a consensus. The current
format seemed better to me, but I'm not particularly attached to it.

I still think Pre-sorted groups should be on a separate line, as in 0002.
In addition to looking better (to me), and being easier to read, another reason
is that there are essentially key=>values here, but the keys are repeated (Sort
Method, etc).

I collapsed this into 0001 because I think that if we're going to do
away with the key=value style then we effectively to have to do this
to avoid the repeated values being confusing (with key=value it kinda
worked, because that made it seem like the avg/peak were clearly a
subset of the Sort Groups info).

I also cleaned up the newline patch a bit in the process (we already
have a way to indent with a flag so don't need to do it directly).

I also suggested to rename: s/Presorted/Pre-sorted/, but I didn't do that here.

I went ahead and did that too; we already use "Full-sort", so the
proposed change makes both parallel.

Michael already patched most of the comment typos, the remainder I'm including
here as a "nearby patch"..

Modified slightly.

James

From becd60ba348558fa241db5cc2100a84b757cbdc5 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Mon, 6 Apr 2020 17:37:31 -0500
Subject: [PATCH v2 2/2] comment typos: Incremental Sort

commit d2d8a229bc58a2014dce1c7a4fcdb6c5ab9fb8da
Author: Tomas Vondra <tomas.vondra@postgresql.org>

Previously reported here:
/messages/by-id/20200407042521.GH2228@telsasoft.com
---
src/backend/commands/explain.c | 4 ++--
src/backend/executor/nodeIncrementalSort.c | 10 ++++++----
src/backend/utils/sort/tuplesort.c | 8 ++++----
3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5f91c569a0..86c10458f4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2865,7 +2865,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
}

/*
- * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for an incremental sort node
*/
static void
show_incremental_sort_info(IncrementalSortState *incrsortstate,
@@ -2913,7 +2913,7 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
&incrsortstate->shared_info->sinfo[n];

/*
-			 * If a worker hasn't process any sort groups at all, then exclude
+			 * If a worker hasn't processed any sort groups at all, then exclude
* it from output since it either didn't launch or didn't
* contribute anything meaningful.
*/
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 39ba11cdf7..05c60ec3e0 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -987,8 +987,8 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)

/*
* Incremental sort can't be used with either EXEC_FLAG_REWIND,
-	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
-	 * batches in the current sort state.
+	 * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because the current sort state
+	 * contains only one sort batch rather than the full result set.
*/
Assert((eflags & (EXEC_FLAG_BACKWARD |
EXEC_FLAG_MARK)) == 0);
@@ -1153,8 +1153,10 @@ ExecReScanIncrementalSort(IncrementalSortState *node)
/*
* If we've set up either of the sort states yet, we need to reset them.
* We could end them and null out the pointers, but there's no reason to
-	 * repay the setup cost, and because guard setting up pivot comparator
-	 * state similarly, doing so might actually cause a leak.
+	 * repay the setup cost, and because ExecIncrementalSort guards
+	 * presorted column functions by checking to see if the full sort state
+	 * has been initialized yet, setting the sort states to null here might
+	 * actually cause a leak.
*/
if (node->fullsort_state != NULL)
{
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index de38c6c7e0..d59e3d5a8d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1428,11 +1428,11 @@ tuplesort_updatemax(Tuplesortstate *state)
}

/*
-	 * Sort evicts data to the disk when it didn't manage to fit those data to
-	 * the main memory.  This is why we assume space used on the disk to be
+	 * Sort evicts data to the disk when it wasn't able to fit that data into
+	 * main memory.  This is why we assume space used on the disk to be
* more important for tracking resource usage than space used in memory.
-	 * Note that amount of space occupied by some tuple set on the disk might
-	 * be less than amount of space occupied by the same tuple set in the
+	 * Note that the amount of space occupied by some tupleset on the disk might
+	 * be less than amount of space occupied by the same tupleset in
* memory due to more compact representation.
*/
if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
-- 
2.17.1

From 8e80be2b345c3940f76ffbd5e3c201a7ae855784 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Wed, 15 Apr 2020 08:45:21 -0500
Subject: [PATCH v2 1/2] Fix explain output for incr sort:

- Two spaces
- colons rather than equals
- Don't use semicolon
- Put Pre-sorted groups on a separate line
- Rename Presorted to Pre-sorted (to match Full-sort)
---
src/backend/commands/explain.c | 22 ++++++++-----------
.../regress/expected/incremental_sort.out | 21 +++++++++---------
src/test/regress/sql/incremental_sort.sql | 4 ++--
3 files changed, 22 insertions(+), 25 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7ae6131676..5f91c569a0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2778,7 +2778,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
{
if (indent)
appendStringInfoSpaces(es->str, es->indent * 2);
-		appendStringInfo(es->str, "%s Groups: " INT64_FORMAT " Sort Method", groupLabel,
+		appendStringInfo(es->str, "%s Groups: " INT64_FORMAT "  Sort Method", groupLabel,
groupInfo->groupCount);
/* plural/singular based on methodNames size */
if (list_length(methodNames) > 1)
@@ -2798,24 +2798,20 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
const char *spaceTypeName;

spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
-			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+			appendStringInfo(es->str, "  Average %s: %ldkB  Peak %s: %ldkB",
spaceTypeName, avgSpace,
-							 groupInfo->maxMemorySpaceUsed);
+							 spaceTypeName, groupInfo->maxMemorySpaceUsed);
}

if (groupInfo->maxDiskSpaceUsed > 0)
{
long avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
-
const char *spaceTypeName;

spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
-			/* Add a semicolon separator only if memory stats were printed. */
-			if (groupInfo->maxMemorySpaceUsed > 0)
-				appendStringInfo(es->str, ";");
-			appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
+			appendStringInfo(es->str, "  Average %s: %ldkB  Peak %s: %ldkB",
spaceTypeName, avgSpace,
-							 groupInfo->maxDiskSpaceUsed);
+							 spaceTypeName, groupInfo->maxDiskSpaceUsed);
}
}
else
@@ -2899,8 +2895,8 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
if (prefixsortGroupInfo->groupCount > 0)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
-				appendStringInfo(es->str, " ");
-			show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+				appendStringInfo(es->str, "\n");
+			show_incremental_sort_group_info(prefixsortGroupInfo, "Pre-sorted", true, es);
}
if (es->format == EXPLAIN_FORMAT_TEXT)
appendStringInfo(es->str, "\n");
@@ -2943,8 +2939,8 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
if (prefixsortGroupInfo->groupCount > 0)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
-					appendStringInfo(es->str, " ");
-				show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
+					appendStringInfo(es->str, "\n");
+				show_incremental_sort_group_info(prefixsortGroupInfo, "Pre-sorted", true, es);
}
if (es->format == EXPLAIN_FORMAT_TEXT)
appendStringInfo(es->str, "\n");
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 238d89a206..2b40a26d82 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -106,7 +106,7 @@ declare
space_key text;
begin
for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
-    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
node := jsonb_set(node, array[group_key, space_key, 'Peak Sort Space Used'], '"NN"', false);
@@ -128,7 +128,7 @@ declare
space_key text;
begin
for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
-    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
group_stats := node->group_key;
for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
if (group_stats->space_key->'Peak Sort Space Used')::bigint < (group_stats->space_key->'Peak Sort Space Used')::bigint then
@@ -533,13 +533,13 @@ select * from (select * from t order by a) s order by a, b limit 55;

-- Test EXPLAIN ANALYZE with only a fullsort group.
select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
-                                 explain_analyze_without_memory                                 
-------------------------------------------------------------------------------------------------
+                                        explain_analyze_without_memory                                         
+---------------------------------------------------------------------------------------------------------------
Limit (actual rows=55 loops=1)
->  Incremental Sort (actual rows=55 loops=1)
Sort Key: t.a, t.b
Presorted Key: t.a
-         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         Full-sort Groups: 2  Sort Methods: top-N heapsort, quicksort  Average Memory: NNkB  Peak Memory: NNkB
->  Sort (actual rows=101 loops=1)
Sort Key: t.a
Sort Method: quicksort  Memory: NNkB
@@ -708,18 +708,19 @@ select * from t left join (select * from (select * from t order by a) v order by
rollback;
-- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
-                                                                    explain_analyze_without_memory                                                                    
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+                                         explain_analyze_without_memory                                         
+----------------------------------------------------------------------------------------------------------------
Limit (actual rows=70 loops=1)
->  Incremental Sort (actual rows=70 loops=1)
Sort Key: t.a, t.b
Presorted Key: t.a
-         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
+         Full-sort Groups: 1  Sort Method: quicksort  Average Memory: NNkB  Peak Memory: NNkB
+         Pre-sorted Groups: 5  Sort Methods: top-N heapsort, quicksort  Average Memory: NNkB  Peak Memory: NNkB
->  Sort (actual rows=1000 loops=1)
Sort Key: t.a
Sort Method: quicksort  Memory: NNkB
->  Seq Scan on t (actual rows=1000 loops=1)
-(9 rows)
+(10 rows)

select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s order by a, b limit 70'));
jsonb_pretty                   
@@ -747,7 +748,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
"Average Sort Space Used": "NN"+
}                                  +
},                                     +
-         "Presorted Groups": {                  +
+         "Pre-sorted Groups": {                 +
"Group Count": 5,                  +
"Sort Methods Used": [             +
"top-N heapsort",              +
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index 2241cc9c02..6f70b36a81 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -82,7 +82,7 @@ declare
space_key text;
begin
for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
-    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
node := jsonb_set(node, array[group_key, space_key, 'Peak Sort Space Used'], '"NN"', false);
@@ -105,7 +105,7 @@ declare
space_key text;
begin
for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
-    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
+    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
group_stats := node->group_key;
for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
if (group_stats->space_key->'Peak Sort Space Used')::bigint < (group_stats->space_key->'Peak Sort Space Used')::bigint then
-- 
2.17.1

--
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581

#348

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Justin Pryzby (#347)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, May 09, 2020 at 03:18:36PM -0500, Justin Pryzby wrote:

Checking if it's possible to address this Opened Item before 13b1.

https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items
consistency of explain output: two spaces, equals vs colons, semicolons (incremental sort)

Yes. Now that the other items related to incremental sort are fixed,
this is next on my TODO.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#349

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Tomas Vondra (#348)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, May 09, 2020 at 10:50:12PM +0200, Tomas Vondra wrote:

On Sat, May 09, 2020 at 03:18:36PM -0500, Justin Pryzby wrote:

Checking if it's possible to address this Opened Item before 13b1.

https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items
consistency of explain output: two spaces, equals vs colons, semicolons (incremental sort)

Yes. Now that the other items related to incremental sort are fixed,
this is next on my TODO.

OK, so aside from the typo/comment fixes, the proposed changes to the
explain format are:

- two spaces to separate groups of related values
- use colons rather than equals for fields
- don't use semicolons
- split full-groups and prefix-groups to separate lines

I'm generally OK with most of this - I'd probably keep the single-line
format, but I don't feel very strongly about that and if others think
using two lines is better ...

Barring objections I'll get this polished and pushed soon-ish (say,
early next week).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#350

Peter Geoghegan

pg@bowt.ie

over 5 years ago

In reply to: Tomas Vondra (#349)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, May 9, 2020 at 3:19 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm generally OK with most of this - I'd probably keep the single-line
format, but I don't feel very strongly about that and if others think
using two lines is better ...

Barring objections I'll get this polished and pushed soon-ish (say,
early next week).

I see something about starting a new thread on the Open Items page.
Please CC me on this.

Speaking in my capacity as an RMT member: Glad to see that you plan to
close this item out next week. (I had planned on giving you a nudge
about this, but it looks like I don't really have to now.)

Thanks
--
Peter Geoghegan

#351

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Peter Geoghegan (#350)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sat, May 09, 2020 at 06:48:02PM -0700, Peter Geoghegan wrote:

On Sat, May 9, 2020 at 3:19 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm generally OK with most of this - I'd probably keep the single-line
format, but I don't feel very strongly about that and if others think
using two lines is better ...

Barring objections I'll get this polished and pushed soon-ish (say,
early next week).

I see something about starting a new thread on the Open Items page.
Please CC me on this.

Speaking in my capacity as an RMT member: Glad to see that you plan to
close this item out next week. (I had planned on giving you a nudge
about this, but it looks like I don't really have to now.)

Not sure about about the new thread - the discussion continues on the
main incremental sort thread, I don't see any proposal to start a new
thread there. IMO it'd be pointless at this point.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#352

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Tomas Vondra (#351)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Sun, May 10, 2020 at 02:25:23PM +0200, Tomas Vondra wrote:

On Sat, May 09, 2020 at 06:48:02PM -0700, Peter Geoghegan wrote:

On Sat, May 9, 2020 at 3:19 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm generally OK with most of this - I'd probably keep the single-line
format, but I don't feel very strongly about that and if others think
using two lines is better ...

Barring objections I'll get this polished and pushed soon-ish (say,
early next week).

I see something about starting a new thread on the Open Items page.
Please CC me on this.

Speaking in my capacity as an RMT member: Glad to see that you plan to
close this item out next week. (I had planned on giving you a nudge
about this, but it looks like I don't really have to now.)

Not sure about about the new thread - the discussion continues on the
main incremental sort thread, I don't see any proposal to start a new
thread there. IMO it'd be pointless at this point.

I've pushed both patches, fixing typos and explain format.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#353

Peter Geoghegan

pg@bowt.ie

over 5 years ago

In reply to: Tomas Vondra (#352)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, May 12, 2020 at 11:18 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I've pushed both patches, fixing typos and explain format.

Thanks, Tomas.

--
Peter Geoghegan

#354

Justin Pryzby

pryzby@telsasoft.com

over 5 years ago

In reply to: James Coleman (#321)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:

I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit. Find attached some suggestions.

Shouldn't non-text explain output always show both disk *and* mem, including
zeros ?

Could you give more context on this? Is there a standard to follow?
Regular sort nodes only ever report one type, so there's not a good
parallel there.

The change I proposed was like:

@@ -2829,7 +2829,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,

ExplainPropertyList("Sort Methods Used", methodNames, es);

- if (groupInfo->maxMemorySpaceUsed > 0)
{
long avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
const char *spaceTypeName;
...
- if (groupInfo->maxDiskSpaceUsed > 0)
{
...

To show in non-text format *both* disk and memory space used, even if zero.

I still think that's what's desirable.

If it's important to show *whether* a sort space was used, then I think there
should be a boolean, or an int 0/1. But I don't think it's actually needed,
since someone parsing the explain output could just check
|if _dict['Peak Sort Space Used'] > 0: ...
the same as you're doing, without having to write some variation on:
|if 'Peak Sort Space Used' in _dict and _dict['Peak Sort Space Used'] > 0: ...

--
Justin

#355

James Coleman

jtc331@gmail.com

over 5 years ago

In reply to: Justin Pryzby (#354)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Fri, Jun 19, 2020 at 12:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:

I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit. Find attached some suggestions.

Shouldn't non-text explain output always show both disk *and* mem, including
zeros ?

Could you give more context on this? Is there a standard to follow?
Regular sort nodes only ever report one type, so there's not a good
parallel there.

The change I proposed was like:

@@ -2829,7 +2829,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,

ExplainPropertyList("Sort Methods Used", methodNames, es);

- if (groupInfo->maxMemorySpaceUsed > 0)
{
long avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
const char *spaceTypeName;
...
- if (groupInfo->maxDiskSpaceUsed > 0)
{
...

To show in non-text format *both* disk and memory space used, even if zero.

I still think that's what's desirable.

I'm of the opposite opinion. I believe showing both unnecessarily is confusing.

If it's important to show *whether* a sort space was used, then I think there
should be a boolean, or an int 0/1. But I don't think it's actually needed,
since someone parsing the explain output could just check
|if _dict['Peak Sort Space Used'] > 0: ...
the same as you're doing, without having to write some variation on:
|if 'Peak Sort Space Used' in _dict and _dict['Peak Sort Space Used'] > 0: ...

I think it's desirable for code to be explicitly about the type having
been used rather than implicitly assuming it based on 0/non-zero
values.

James

#356

James Coleman

jtc331@gmail.com

over 5 years ago

In reply to: James Coleman (#355)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

It seems like the consensus over at another discussion on this topic
[1]: /messages/by-id/2276865.1593102811@sss.pgh.pa.us
readable output formats], even though that creates some interesting
scenarios like the fact that disk sorts will print 0 for memory even
though that's not true.

The change has already been made and pushed for hash disk spilling, so
I think we ought to use Justin's patch here.

James

[1]: /messages/by-id/2276865.1593102811@sss.pgh.pa.us

#357

Jonathan S. Katz

jkatz@postgresql.org

over 5 years ago

In reply to: James Coleman (#356)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On 7/2/20 11:47 AM, James Coleman wrote:

It seems like the consensus over at another discussion on this topic
[1] is that we ought to go ahead and print the zeros [for machine
readable output formats], even though that creates some interesting
scenarios like the fact that disk sorts will print 0 for memory even
though that's not true.

The change has already been made and pushed for hash disk spilling, so
I think we ought to use Justin's patch here.

Do people agree with James analysis? From the RMT perspective, we would
like to get this open item wrapped up for the next beta, given[1]/messages/by-id/CAApHDvpBQx4Shmisjp7oKr=ECX18KYKPB=KpdWYxMKQNvisgvQ@mail.gmail.com is now
resolved.

Thanks!

Jonathan

[1]: /messages/by-id/CAApHDvpBQx4Shmisjp7oKr=ECX18KYKPB=KpdWYxMKQNvisgvQ@mail.gmail.com
/messages/by-id/CAApHDvpBQx4Shmisjp7oKr=ECX18KYKPB=KpdWYxMKQNvisgvQ@mail.gmail.com

#358

Peter Geoghegan

pg@bowt.ie

over 5 years ago

In reply to: Jonathan S. Katz (#357)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Jul 9, 2020 at 12:06 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:

On 7/2/20 11:47 AM, James Coleman wrote:

It seems like the consensus over at another discussion on this topic
[1] is that we ought to go ahead and print the zeros [for machine
readable output formats], even though that creates some interesting
scenarios like the fact that disk sorts will print 0 for memory even
though that's not true.

The change has already been made and pushed for hash disk spilling, so
I think we ought to use Justin's patch here.

Do people agree with James analysis? From the RMT perspective, we would
like to get this open item wrapped up for the next beta, given[1] is now
resolved.

Tomas, Justin: Ping? Can we get an update on this?

Just for the record, David Rowley fixed the similar hashagg issue in
commit 40efbf8706cdd96e06bc4d1754272e46d9857875. I don't see any
reason for the delay here.

Thanks
--
Peter Geoghegan

#359

Peter Geoghegan

pg@bowt.ie

over 5 years ago

In reply to: James Coleman (#356)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

On Thu, Jul 2, 2020 at 8:47 AM James Coleman <jtc331@gmail.com> wrote:

It seems like the consensus over at another discussion on this topic
[1] is that we ought to go ahead and print the zeros [for machine
readable output formats], even though that creates some interesting
scenarios like the fact that disk sorts will print 0 for memory even
though that's not true.

What about having it print -1 for memory in this case instead? That's
still not ideal, but machine readable EXPLAIN output ought to
consistently show the same information per node, even when the answer
is in some sense indeterminate. That seems to be the standard that
we've settled on.

It might be worth teaching the JSON format to show a JSON null or
something instead. Not sure about that, though.

--
Peter Geoghegan